# Lesson 1: Introduction to Data Science

Hello there! In this lesson, we'll explore some basic data processing techniques using Python.

First, let's load up the data sets. We will use two data sets, Amazon Reviews, and Titanic Passengers, as examples of unstructured and structured data.

The Python Data Analysis Library (pandas) is a great tool for organizing our data. Here, we will load the data into [Data Frames](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html), which are tabular data structures.

Note that, for most of the examples here, there are several ways to achieve the same results (e.g., using numpy, scipy, statistics, and other libraries). It is always best to check the forums and see which way is the most efficient and appropriate for your specific problem.

In [None]:
# Load datasets
import pandas as pd
import csv

# Amazon Review Data
url_amazon = 'https://drive.google.com/file/d/1MR1OKC6eimuRNUr8Z7z9oIJ98kyFsT7k/view?usp=share_link'
url_amazon = 'https://drive.google.com/uc?id=' + url_amazon.split('/')[-2]
df_amazon = pd.read_csv(url_amazon,sep='\t', names=['label','text'], quoting=csv.QUOTE_NONE)

# Titanic Passengers Data
url_titanic = 'https://drive.google.com/file/d/1gq9zHF_uZrmb4Tr3iskDVcgw7jKMaRG4/view?usp=share_link'
url_titanic = 'https://drive.google.com/uc?id=' + url_titanic.split('/')[-2]
df_titanic = pd.read_csv(url_titanic, sep=',')

##A. Unstructured Data

The Amazon Review data set contains two features:
* `label`: 1 for 1- or 2-star reviews; 2 for 4- or 5-star reviews
* `text`: actual review

Since we're dealing with raw text, we need to preprocess the data before performing any analysis.

In [None]:
# Amazon Review Data
print("Amazon Review Data (samples):")
df_amazon.head(10).style.set_properties(**{'text-align': 'left'})

Amazon Review Data (samples):


Unnamed: 0,label,text
0,__label__2,"Great CD: My lovely Pat has one of the GREAT voices of her generation. I have listened to this CD for YEARS and I still LOVE IT. When I'm in a good mood it makes me feel better. A bad mood just evaporates like sugar in the rain. This CD just oozes LIFE. Vocals are jusat STUUNNING and lyrics just kill. One of life's hidden gems. This is a desert isle CD in my book. Why she never made it big is just beyond me. Everytime I play this, no matter black, white, young, old, male, female EVERYBODY says one thing ""Who was that singing ?"""
1,__label__2,"One of the best game music soundtracks - for a game I didn't really play: Despite the fact that I have only played a small portion of the game, the music I heard (plus the connection to Chrono Trigger which was great as well) led me to purchase the soundtrack, and it remains one of my favorite albums. There is an incredible mix of fun, epic, and emotional songs. Those sad and beautiful tracks I especially like, as there's not too many of those kinds of songs in my other video game soundtracks. I must admit that one of the songs (Life-A Distant Promise) has brought tears to my eyes on many occasions.My one complaint about this soundtrack is that they use guitar fretting effects in many of the songs, which I find distracting. But even if those weren't included I would still consider the collection worth it."
2,__label__1,"Batteries died within a year ...: I bought this charger in Jul 2003 and it worked OK for a while. The design is nice and convenient. However, after about a year, the batteries would not hold a charge. Might as well just get alkaline disposables, or look elsewhere for a charger that comes with batteries that have better staying power."
3,__label__2,"works fine, but Maha Energy is better: Check out Maha Energy's website. Their Powerex MH-C204F charger works in 100 minutes for rapid charge, with option for slower charge (better for batteries). And they have 2200 mAh batteries."
4,__label__2,"Great for the non-audiophile: Reviewed quite a bit of the combo players and was hesitant due to unfavorable reviews and size of machines. I am weaning off my VHS collection, but don't want to replace them with DVD's. This unit is well built, easy to setup and resolution and special effects (no progressive scan for HDTV owners) suitable for many people looking for a versatile product.Cons- No universal remote."
5,__label__1,"DVD Player crapped out after one year: I also began having the incorrect disc problems that I've read about on here. The VCR still works, but hte DVD side is useless. I understand that DVD players sometimes just quit on you, but after not even one year? To me that's a sign on bad quality. I'm giving up JVC after this as well. I'm sticking to Sony or giving another brand a shot."
6,__label__1,"Incorrect Disc: I love the style of this, but after a couple years, the DVD is giving me problems. It doesn't even work anymore and I use my broken PS2 Now. I wouldn't recommend this, I'm just going to upgrade to a recorder now. I wish it would work but I guess i'm giving up on JVC. I really did like this one... before it stopped working. The dvd player gave me problems probably after a year of having it."
7,__label__1,"DVD menu select problems: I cannot scroll through a DVD menu that is set up vertically. The triangle keys will only select horizontally. So I cannot select anything on most DVD's besides play. No special features, no language select, nothing, just play."
8,__label__2,"Unique Weird Orientalia from the 1930's: Exotic tales of the Orient from the 1930's. ""Dr Shen Fu"", a Weird Tales magazine reprint, is about the elixir of life that grants immortality at a price. If you're tired of modern authors who all sound alike, this is the antidote for you. Owen's palette is loaded with splashes of Chinese and Japanese colours. Marvelous."
9,__label__1,"Not an ""ultimate guide"": Firstly,I enjoyed the format and tone of the book (how the author addressed the reader). However, I did not feel that she imparted any insider secrets that the book promised to reveal. If you are just starting to research law school, and do not know all the requirements of admission, then this book may be a tremendous help. If you have done your homework and are looking for an edge when it comes to admissions, I recommend some more topic-specific books. For example, books on how to write your personal statment, books geared specifically towards LSAT preparation (Powerscore books were the most helpful for me), and there are some websites with great advice geared towards aiding the individuals whom you are asking to write letters of recommendation. Yet, for those new to the entire affair, this book can definitely clarify the requirements for you."


### Counting words
Counting the number of words/phrases in the texts could tell us a few things. 

*How long are Amazon reviews typically?* 
*Are bad reviews longer than good reviews?*

In [None]:
# Count number of words
print("Review word count:")
df_amazon['count'] = df_amazon['text'].str.split().str.len()
df_amazon.head(10).style

Review word count:


Unnamed: 0,label,text,count
0,__label__2,"Great CD: My lovely Pat has one of the GREAT voices of her generation. I have listened to this CD for YEARS and I still LOVE IT. When I'm in a good mood it makes me feel better. A bad mood just evaporates like sugar in the rain. This CD just oozes LIFE. Vocals are jusat STUUNNING and lyrics just kill. One of life's hidden gems. This is a desert isle CD in my book. Why she never made it big is just beyond me. Everytime I play this, no matter black, white, young, old, male, female EVERYBODY says one thing ""Who was that singing ?""",106
1,__label__2,"One of the best game music soundtracks - for a game I didn't really play: Despite the fact that I have only played a small portion of the game, the music I heard (plus the connection to Chrono Trigger which was great as well) led me to purchase the soundtrack, and it remains one of my favorite albums. There is an incredible mix of fun, epic, and emotional songs. Those sad and beautiful tracks I especially like, as there's not too many of those kinds of songs in my other video game soundtracks. I must admit that one of the songs (Life-A Distant Promise) has brought tears to my eyes on many occasions.My one complaint about this soundtrack is that they use guitar fretting effects in many of the songs, which I find distracting. But even if those weren't included I would still consider the collection worth it.",148
2,__label__1,"Batteries died within a year ...: I bought this charger in Jul 2003 and it worked OK for a while. The design is nice and convenient. However, after about a year, the batteries would not hold a charge. Might as well just get alkaline disposables, or look elsewhere for a charger that comes with batteries that have better staying power.",60
3,__label__2,"works fine, but Maha Energy is better: Check out Maha Energy's website. Their Powerex MH-C204F charger works in 100 minutes for rapid charge, with option for slower charge (better for batteries). And they have 2200 mAh batteries.",37
4,__label__2,"Great for the non-audiophile: Reviewed quite a bit of the combo players and was hesitant due to unfavorable reviews and size of machines. I am weaning off my VHS collection, but don't want to replace them with DVD's. This unit is well built, easy to setup and resolution and special effects (no progressive scan for HDTV owners) suitable for many people looking for a versatile product.Cons- No universal remote.",69
5,__label__1,"DVD Player crapped out after one year: I also began having the incorrect disc problems that I've read about on here. The VCR still works, but hte DVD side is useless. I understand that DVD players sometimes just quit on you, but after not even one year? To me that's a sign on bad quality. I'm giving up JVC after this as well. I'm sticking to Sony or giving another brand a shot.",73
6,__label__1,"Incorrect Disc: I love the style of this, but after a couple years, the DVD is giving me problems. It doesn't even work anymore and I use my broken PS2 Now. I wouldn't recommend this, I'm just going to upgrade to a recorder now. I wish it would work but I guess i'm giving up on JVC. I really did like this one... before it stopped working. The dvd player gave me problems probably after a year of having it.",80
7,__label__1,"DVD menu select problems: I cannot scroll through a DVD menu that is set up vertically. The triangle keys will only select horizontally. So I cannot select anything on most DVD's besides play. No special features, no language select, nothing, just play.",42
8,__label__2,"Unique Weird Orientalia from the 1930's: Exotic tales of the Orient from the 1930's. ""Dr Shen Fu"", a Weird Tales magazine reprint, is about the elixir of life that grants immortality at a price. If you're tired of modern authors who all sound alike, this is the antidote for you. Owen's palette is loaded with splashes of Chinese and Japanese colours. Marvelous.",62
9,__label__1,"Not an ""ultimate guide"": Firstly,I enjoyed the format and tone of the book (how the author addressed the reader). However, I did not feel that she imparted any insider secrets that the book promised to reveal. If you are just starting to research law school, and do not know all the requirements of admission, then this book may be a tremendous help. If you have done your homework and are looking for an edge when it comes to admissions, I recommend some more topic-specific books. For example, books on how to write your personal statment, books geared specifically towards LSAT preparation (Powerscore books were the most helpful for me), and there are some websites with great advice geared towards aiding the individuals whom you are asking to write letters of recommendation. Yet, for those new to the entire affair, this book can definitely clarify the requirements for you.",148


Here we can see that:
* Average (mean) review length is ~80 words
* Shortest review contained 14 words
* Longest review reached 204 words

In [None]:
# Describe simple statistics of 'count' variable
df_amazon['count'].describe()

count    10000.000000
mean        79.867900
std         43.368444
min         14.000000
25%         43.000000
50%         72.000000
75%        110.000000
max        204.000000
Name: count, dtype: float64

### Counting word frequency

Counting the number of times (frequency) each word appeared across all reviews is another valuable piece of information we can extract from the unstructured data. It would be interesting to associate certain words with positive or negative reviews.

In [None]:
# Count word frequency
print("Most commonly used words in reviews:")
word_freq = df_amazon['text'].str.lower().str.split().explode().value_counts()
word_freq[:10].to_frame().style

Most commonly used words in reviews:


Unnamed: 0,text
the,40420
and,21435
a,20507
i,19590
to,18908
of,16932
this,14726
is,14333
it,12831
in,9568


Notice that the most frequent words ("the", "and", "a", etc.) do not give us any useful information. These are called stop words; we should exclude them when counting word frequency.

We can use the Natural Language Toolkit (nltk) library to remove stop words. 

For now, let's deal with English stopwords.

In [None]:
# Remove stop words
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = stopwords.words('english')
df_amazon['text_stop'] = df_amazon['text'].str.lower()
df_amazon['text_stop'] = df_amazon['text_stop'].apply(lambda words: ' '.join(word for word in words.split() if word not in stop_words))
df_amazon.head(10).style

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Unnamed: 0,label,text,count,text_stop
0,__label__2,"Great CD: My lovely Pat has one of the GREAT voices of her generation. I have listened to this CD for YEARS and I still LOVE IT. When I'm in a good mood it makes me feel better. A bad mood just evaporates like sugar in the rain. This CD just oozes LIFE. Vocals are jusat STUUNNING and lyrics just kill. One of life's hidden gems. This is a desert isle CD in my book. Why she never made it big is just beyond me. Everytime I play this, no matter black, white, young, old, male, female EVERYBODY says one thing ""Who was that singing ?""",106,"great cd: lovely pat one great voices generation. listened cd years still love it. i'm good mood makes feel better. bad mood evaporates like sugar rain. cd oozes life. vocals jusat stuunning lyrics kill. one life's hidden gems. desert isle cd book. never made big beyond me. everytime play this, matter black, white, young, old, male, female everybody says one thing ""who singing ?"""
1,__label__2,"One of the best game music soundtracks - for a game I didn't really play: Despite the fact that I have only played a small portion of the game, the music I heard (plus the connection to Chrono Trigger which was great as well) led me to purchase the soundtrack, and it remains one of my favorite albums. There is an incredible mix of fun, epic, and emotional songs. Those sad and beautiful tracks I especially like, as there's not too many of those kinds of songs in my other video game soundtracks. I must admit that one of the songs (Life-A Distant Promise) has brought tears to my eyes on many occasions.My one complaint about this soundtrack is that they use guitar fretting effects in many of the songs, which I find distracting. But even if those weren't included I would still consider the collection worth it.",148,"one best game music soundtracks - game really play: despite fact played small portion game, music heard (plus connection chrono trigger great well) led purchase soundtrack, remains one favorite albums. incredible mix fun, epic, emotional songs. sad beautiful tracks especially like, there's many kinds songs video game soundtracks. must admit one songs (life-a distant promise) brought tears eyes many occasions.my one complaint soundtrack use guitar fretting effects many songs, find distracting. even included would still consider collection worth it."
2,__label__1,"Batteries died within a year ...: I bought this charger in Jul 2003 and it worked OK for a while. The design is nice and convenient. However, after about a year, the batteries would not hold a charge. Might as well just get alkaline disposables, or look elsewhere for a charger that comes with batteries that have better staying power.",60,"batteries died within year ...: bought charger jul 2003 worked ok while. design nice convenient. however, year, batteries would hold charge. might well get alkaline disposables, look elsewhere charger comes batteries better staying power."
3,__label__2,"works fine, but Maha Energy is better: Check out Maha Energy's website. Their Powerex MH-C204F charger works in 100 minutes for rapid charge, with option for slower charge (better for batteries). And they have 2200 mAh batteries.",37,"works fine, maha energy better: check maha energy's website. powerex mh-c204f charger works 100 minutes rapid charge, option slower charge (better batteries). 2200 mah batteries."
4,__label__2,"Great for the non-audiophile: Reviewed quite a bit of the combo players and was hesitant due to unfavorable reviews and size of machines. I am weaning off my VHS collection, but don't want to replace them with DVD's. This unit is well built, easy to setup and resolution and special effects (no progressive scan for HDTV owners) suitable for many people looking for a versatile product.Cons- No universal remote.",69,"great non-audiophile: reviewed quite bit combo players hesitant due unfavorable reviews size machines. weaning vhs collection, want replace dvd's. unit well built, easy setup resolution special effects (no progressive scan hdtv owners) suitable many people looking versatile product.cons- universal remote."
5,__label__1,"DVD Player crapped out after one year: I also began having the incorrect disc problems that I've read about on here. The VCR still works, but hte DVD side is useless. I understand that DVD players sometimes just quit on you, but after not even one year? To me that's a sign on bad quality. I'm giving up JVC after this as well. I'm sticking to Sony or giving another brand a shot.",73,"dvd player crapped one year: also began incorrect disc problems i've read here. vcr still works, hte dvd side useless. understand dvd players sometimes quit you, even one year? that's sign bad quality. i'm giving jvc well. i'm sticking sony giving another brand shot."
6,__label__1,"Incorrect Disc: I love the style of this, but after a couple years, the DVD is giving me problems. It doesn't even work anymore and I use my broken PS2 Now. I wouldn't recommend this, I'm just going to upgrade to a recorder now. I wish it would work but I guess i'm giving up on JVC. I really did like this one... before it stopped working. The dvd player gave me problems probably after a year of having it.",80,"incorrect disc: love style this, couple years, dvd giving problems. even work anymore use broken ps2 now. recommend this, i'm going upgrade recorder now. wish would work guess i'm giving jvc. really like one... stopped working. dvd player gave problems probably year it."
7,__label__1,"DVD menu select problems: I cannot scroll through a DVD menu that is set up vertically. The triangle keys will only select horizontally. So I cannot select anything on most DVD's besides play. No special features, no language select, nothing, just play.",42,"dvd menu select problems: cannot scroll dvd menu set vertically. triangle keys select horizontally. cannot select anything dvd's besides play. special features, language select, nothing, play."
8,__label__2,"Unique Weird Orientalia from the 1930's: Exotic tales of the Orient from the 1930's. ""Dr Shen Fu"", a Weird Tales magazine reprint, is about the elixir of life that grants immortality at a price. If you're tired of modern authors who all sound alike, this is the antidote for you. Owen's palette is loaded with splashes of Chinese and Japanese colours. Marvelous.",62,"unique weird orientalia 1930's: exotic tales orient 1930's. ""dr shen fu"", weird tales magazine reprint, elixir life grants immortality price. tired modern authors sound alike, antidote you. owen's palette loaded splashes chinese japanese colours. marvelous."
9,__label__1,"Not an ""ultimate guide"": Firstly,I enjoyed the format and tone of the book (how the author addressed the reader). However, I did not feel that she imparted any insider secrets that the book promised to reveal. If you are just starting to research law school, and do not know all the requirements of admission, then this book may be a tremendous help. If you have done your homework and are looking for an edge when it comes to admissions, I recommend some more topic-specific books. For example, books on how to write your personal statment, books geared specifically towards LSAT preparation (Powerscore books were the most helpful for me), and there are some websites with great advice geared towards aiding the individuals whom you are asking to write letters of recommendation. Yet, for those new to the entire affair, this book can definitely clarify the requirements for you.",148,"""ultimate guide"": firstly,i enjoyed format tone book (how author addressed reader). however, feel imparted insider secrets book promised reveal. starting research law school, know requirements admission, book may tremendous help. done homework looking edge comes admissions, recommend topic-specific books. example, books write personal statment, books geared specifically towards lsat preparation (powerscore books helpful me), websites great advice geared towards aiding individuals asking write letters recommendation. yet, new entire affair, book definitely clarify requirements you."


Then, let's recount the word frequency.

In [None]:
# Count word frequency
print("Most commonly used words in reviews:")
word_freq = df_amazon['text_stop'].str.lower().str.split().explode().value_counts()
word_freq[:10].to_frame().style

Most commonly used words in reviews:


Unnamed: 0,text_stop
book,4497
one,3293
like,2829
great,2445
good,2410
movie,2267
would,2155
read,1942
get,1795
really,1552


In [None]:
# Count word frequency
print("Most commonly used words in positive reviews:")
word_freq = df_amazon.loc[df_amazon['label']=='__label__2', 'text_stop'].str.lower().str.split().explode().value_counts() # try changing the label
word_freq[:20].to_frame().style

Most commonly used words in positive reviews:


Unnamed: 0,text_stop
book,2281
great,1899
one,1699
good,1387
like,1335
read,1085
movie,997
love,976
would,845
really,820


### Other preprocessing techniques

Other preprocessing steps we can perform are:
* Removing duplicate data
* Removing punctuation and symbols
* Removing URLs
* Tokenization
* Stemming
* Lemmatization

You may read this [blog](https://www.analyticsvidhya.com/blog/2021/06/text-preprocessing-in-nlp-with-python-codes/) for more info.

## B. Structured Data

Structured data is easier to work with since it's already organized. There are four levels of structured data: nominal and ordinal for qualitative data, interval and ratio for quantitative data.

We will use the Titanic data set as an example of structured data. This data set contains data from various levels:
* Nominal: `PassengerId`, `Survived`, `Name`, `Sex`, `Ticket`, `Cabin`, `Embarked`
* Ordinal: `Pclass`
* Ratio: `Age`, `SibSp`, `Parch`, `Fare`

Since there are no interval-level data, we will create a random variable, `BodyTemp` (Celsius).

In structured data, rows represent samples, and columns represent variables/descriptors/features (we'll use the term features when we discuss machine learning).

In [None]:
# Titanic Data
import numpy as np

# Create a random variable:
np.random.seed(132)
df_titanic['BodyTemp'] = np.random.uniform(35.0,38.0,len(df_titanic))

print("Titanic Data:")
df_titanic.head(10).style

Titanic Data:


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,BodyTemp
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,37.342712
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C,36.143434
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,37.485913
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,37.345048
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,37.404791
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,36.070705
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,37.886366
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,36.06872
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S,37.455741
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C,37.715564


You can learn more about dataframe rows, columns, indexing, and selection [here](https://pandas.pydata.org/docs/user_guide/indexing.html).

In [None]:
# Select row (passenger samples)
print("Passenger samples:")
df_titanic.loc[10:15]

Passenger samples:


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,BodyTemp
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S,35.12829
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.55,C103,S,35.048569
12,13,0,3,"Saundercock, Mr. William Henry",male,20.0,0,0,A/5. 2151,8.05,,S,36.668138
13,14,0,3,"Andersson, Mr. Anders Johan",male,39.0,1,5,347082,31.275,,S,35.241584
14,15,0,3,"Vestrom, Miss. Hulda Amanda Adolfina",female,14.0,0,0,350406,7.8542,,S,37.892771
15,16,1,2,"Hewlett, Mrs. (Mary D Kingcome)",female,55.0,0,0,248706,16.0,,S,36.225546


In [None]:
# Select column (variables)
print("Sex and age of passenger samples:")
df_titanic.loc[10:15, ['Sex', 'Age']]

Sex and age of passenger samples:


Unnamed: 0,Sex,Age
10,female,4.0
11,female,58.0
12,male,20.0
13,male,39.0
14,female,14.0
15,female,55.0


##C. Nominal Level (Qualitative)

Nominal data are usually categorical data. Since they don't have numerical values, it doesn't make sense to add, subtract, multiply, or divide this kind of data (even if they are numbers).

### Mode

Finding the most common value/majority/mode is a good measure of center for nominal data.

Here we can see that the modes are:
* `Survived`: 0, most passengers didn't survive
* `Sex`: male, most passengers were male
* `Embarked`: C, most embarked from Cherbourg

Note that we don't get the mode for `PassengerId`, `Name`, `Ticket`, and `Cabin` since the values are mostly unique. Although, if you're interested, you could find the most common male and female passenger names.

In [None]:
# Find the mode
print('Mode values:')
df_titanic[['Survived', 'Sex', 'Embarked']].mode()


Mode values:


Unnamed: 0,Survived,Sex,Embarked
0,0,male,S


##D. Ordinal Level (Qualitative)

Ordinal data are still qualitative (so no arithmetic!), but we can do more things here than on nominal data.

### Ordering
First thing we can do is to sort the data.

In [None]:
# Sort 'Pclass'
print('Ordered values:')
ordered_pclass = df_titanic['Pclass'].sort_values().to_numpy()
print(ordered_pclass)

Ordered values:
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3
 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 3 3 3 3 

### Median

Then we can get the median, which is the middle value of the sorted data.

There are several ways to do this.

In [None]:
# Find the median
print('Median value:')

# numpy way
print('numpy:', np.median(ordered_pclass))

# panda way
print('panda:', df_titanic['Pclass'].median())

Median value:
numpy: 3.0
panda: 3.0


The median of passenger class is 3, meaning third-class passengers were more common on the Titanic.

The median is useful for measuring the center since it's quite robust to outliers (extremely high or low values). In this example, `1000` is obviously an outlier, which skews the data towards it. But the median is not affected by it.

In [None]:
# Median of data with outliers
example = [1,2,2,3,4,5,1000]
print('Data:', example)
print('Median value:', np.median(example))

Data: [1, 2, 2, 3, 4, 5, 1000]
Median value: 3.0


Note that if there is an even number of samples in the data, the median is the average of the two middle values. This might result to rational values (e.g., 2.5). In such cases, it might be better to use other measures such as mode.

In [None]:
# Median of even-numbered data
example = [1,2,2,3,4,5]
print('Data:', example)
print('Median value:', np.median(example))

Data: [1, 2, 2, 3, 4, 5]
Median value: 2.5


##E. Interval Level (Quantitative)

Data on the interval level are quantitative. So arithmetic operations are applicable, but not all. Since there is no absolute/natural zero, multiplying or dividing this kind of data might not make sense; otherwise, we might end up with meaningless ratios or negative values.

However, we can still add, subtract, and measure the average and the variation. 

### Adding/subtracting

We created a random variable `BodyTemp` for the imaginary body temperature of the passengers when they were aboard the Titanic.

We can check how many of them had normal body temperature, which is between 36 to 38 C.


In [None]:
# Check 'BodyTemp' for normal range
print('Passengers with normal body temparature:')
df_titanic['BodyTemp'].between(36,38).value_counts()

Passengers with normal body temparature:


True     607
False    284
Name: BodyTemp, dtype: int64

It looks like one-third of the passengers were at risk of having an abnormal temperature. 

Now suppose over time, people's body temperature decreased by 1 C due to the weather. Let's see if this has affected the passengers' normal temperature.

In [None]:
# Subtract 1C to each passenger then 
# check 'BodyTemp' for normal range again
print('Passengers with normal body temparature:')
df_titanic['BodyTemp'].subtract(1).between(36,38).value_counts()

Passengers with normal body temparature:


False    576
True     315
Name: BodyTemp, dtype: int64

Majority of the passengers became at risk due to this change in temperature!

### Arithmetic mean

Returning to our original `BodyTemp` data, let's measure the average (mean) body temperature across all passengers.

In [None]:
# Find the mean 'BodyTemp' value
mean_bodytemp = df_titanic['BodyTemp'].mean()
print('Mean body temperature: %.2f' % mean_bodytemp)

Mean body temperature: 36.53


### Standard deviation

On average, the passengers' body temperature are normal. This makes sense because majority of them are within the normal range.

But let's see how much their body temperature differ from one another.

In [None]:
# Find the std 'BodyTemp' value
std_bodytemp = df_titanic['BodyTemp'].std()
print('Standard deviation of body temperature: %.2f' % std_bodytemp)

print('\nOn average, passengers had %.2f to %.2f C body temperature.' % (mean_bodytemp-std_bodytemp, mean_bodytemp+std_bodytemp))

Standard deviation of body temperature: 0.87

On average, passengers had 35.67 to 37.40 C body temperature.


Upon closer look, there were passengers who might have been experiencing low body temperature (below 36 C), as we have confirmed through the measure of variation of the data.

##F. Ratio Level (Quantitative)

The ratio level is the highest level of quantitative data. At this level, we can apply all operations and measures mentioned before.

Since the ratio level has a natural zero and is strictly non-negative, we can use multiplication and division.

### Multiplying/dividing

Suppose we want to find how much the Titanic fare would cost in today's economy. Using a simple inflation calculator, we can estimate how much the fare would change from the Titanic event (British Pound, 1912) to today (British Pound, 2023). Furthermore, we can convert this into our own currency (Philippine Peso, 2023).

Note that these are just simple estimates, and may not be accurate.

Inflation calculator: [link](https://www.bankofengland.co.uk/monetary-policy/inflation/inflation-calculator)

Currency exchange: [link](https://www.xe.com/currencyconverter/convert/?Amount=1&From=GBP&To=PHP)

In [None]:
# Convert Titanic Fare to current currency value (GBP)
inflation_rate = 90.40 
df_titanic['Fare_2023'] = df_titanic['Fare'] * inflation_rate

# Convert Titanic Fare to Philippine peso (PHP)
gbp_php = 66.75
df_titanic['Fare_2023_php'] = df_titanic['Fare_2023'] * gbp_php

df_titanic[['Fare','Fare_2023', 'Fare_2023_php']]

Unnamed: 0,Fare,Fare_2023,Fare_2023_php
0,7.2500,655.40000,43747.95000
1,71.2833,6444.01032,430137.68886
2,7.9250,716.42000,47821.03500
3,53.1000,4800.24000,320416.02000
4,8.0500,727.72000,48575.31000
...,...,...,...
886,13.0000,1175.20000,78444.60000
887,30.0000,2712.00000,181026.00000
888,23.4500,2119.88000,141501.99000
889,30.0000,2712.00000,181026.00000


### Geometric mean

Since ratio-level data are always non-negative, we can use the geometric mean as an alternative to measuring the center.

* Arithmetic mean: add all samples, then divide by the number of samples
* Geometric mean: multiply all samples, then get the nth root (n=number of samples)

To compute the geometric mean, we can use the [SciPy library](https://scipy.org/). Note that we'll need to drop the zero values since that would result to a zero product.

The more variations in the data, the greater the difference between the two means.

In [None]:
# Compare the arithmetic and geometric means
from scipy import stats

# Select non-zero Fares
pos_fare = df_titanic.loc[(df_titanic['Fare'] > 0), 'Fare']
# Compute geometric mean
gmean_fare = stats.gmean(pos_fare)

# Compute arithmetic mean
amean_fare = df_titanic['Fare'].mean()

print('Geometric mean of Titanic fare: %.2f' % gmean_fare)
print('Arithmetic mean of Titanic fare: %.2f' % amean_fare)

Geometric mean of Titanic fare: 18.98
Arithmetic mean of Titanic fare: 32.20


Using geometric mean instead of arithmetic mean depends on the variable. In this case (`Fare`), we see that the arithmetic mean is almost twice as high as the geometric mean. We will need to further explore the data before we decide which is a better measure of center.


Geometric mean is mostly applied to financial and time-series analysis. In general, it's useful when the data is in a series and compounds through time. But it cannot be used on data with negative or zero values.

## Summary

In this lesson, you learned the different types of data and what we can do to each type. 

Next lesson, we will discuss the steps in a data science pipeline, and you will learn the different ways to explore our data.

## References
* Ozdemir, *Principles of Data Science*, 2016
* *Python Data Analysis Library*, 2008 ([link](https://pandas.pydata.org/docs/index.html))
* *Scipy Library*, 2001 ([link](https://scipy.org/))
* Deepanshi, *Text Preprocessing in NLP with Python codes*, 2022 ([link](https://www.analyticsvidhya.com/blog/2021/06/text-preprocessing-in-nlp-with-python-codes/))
* *Amazon reviews data set* ([link](https://www.kaggle.com/datasets/bittlingmayer/amazonreviews))
* *Titanic data set* ([link](https://www.kaggle.com/c/titanic))

