<a href="https://colab.research.google.com/github/jbpost2/ST-554-Big-Data-with-Python/blob/main/01_Programming_in_python/17-Numerical_Summaries.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Numerical Summaries

> Justin Post

- Usual first step in an analysis is to get to know your data (an Exploratory Data Analysis (EDA))

- EDA generally consists of a few steps:

    + Understand how your data is stored
    + Do basic data validation
    + Determine rate of missing values
    + Clean data up data as needed
    + Investigate distributions
        - Univariate measures/graphs
        - Multivariate measures/graphs
    + Apply transformations and repeat previous step

Note: These types of webpages are built from Jupyter notebooks (`.ipynb` files). You can access your own versions of them by [clicking here](https://colab.research.google.com/github/jbpost2/ST-554-Big-Data-with-Python/blob/main/01_Programming_in_python/17-Numerical_Summaries.ipynb). **It is highly recommended that you go through and run the notebooks yourself, modifying and rerunning things where you'd like!**

---

## Understand How Data is Stored

First, let's read in some data. Recall, for `.csv` files (comma separated value files) we can read them in using `pandas` `read_csv()` function.

We'll read in the classic titanic data set.

In [1]:
import pandas as pd
titanic_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/titanic.csv")

- The `.info()` method allows us to see how our variables are stored (among other things)
- Column data types should make sense for what you expect!

In [2]:
titanic_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1310 entries, 0 to 1309
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   pclass     1309 non-null   float64
 1   survived   1309 non-null   float64
 2   name       1309 non-null   object 
 3   sex        1309 non-null   object 
 4   age        1046 non-null   float64
 5   sibsp      1309 non-null   float64
 6   parch      1309 non-null   float64
 7   ticket     1309 non-null   object 
 8   fare       1308 non-null   float64
 9   cabin      295 non-null    object 
 10  embarked   1307 non-null   object 
 11  boat       486 non-null    object 
 12  body       121 non-null    float64
 13  home.dest  745 non-null    object 
dtypes: float64(7), object(7)
memory usage: 143.4+ KB


- `.head()` and `.tail()` help to see what we have as well

In [3]:
titanic_data.head() #clearly some missing values with NaNs

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1.0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0,0.0,0.0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1.0,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1.0,0.0,"Allison, Miss. Helen Loraine",female,2.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1.0,0.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1.0,2.0,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1.0,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [4]:
titanic_data.tail() #note the last row of NaN (not a number)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
1305,3.0,0.0,"Zabour, Miss. Thamine",female,,1.0,0.0,2665.0,14.4542,,C,,,
1306,3.0,0.0,"Zakarian, Mr. Mapriededer",male,26.5,0.0,0.0,2656.0,7.225,,C,,304.0,
1307,3.0,0.0,"Zakarian, Mr. Ortin",male,27.0,0.0,0.0,2670.0,7.225,,C,,,
1308,3.0,0.0,"Zimmerman, Mr. Leo",male,29.0,0.0,0.0,315082.0,7.875,,S,,,
1309,,,,,,,,,,,,,,


---

## Do Basic Data Validation

- Use the `describe()` method on a data frame
- Check that the min's, max's, etc. all make sense!

In [5]:
titanic_data.describe()

Unnamed: 0,pclass,survived,age,sibsp,parch,fare,body
count,1309.0,1309.0,1046.0,1309.0,1309.0,1308.0,121.0
mean,2.294882,0.381971,29.881135,0.498854,0.385027,33.295479,160.809917
std,0.837836,0.486055,14.4135,1.041658,0.86556,51.758668,97.696922
min,1.0,0.0,0.1667,0.0,0.0,0.0,1.0
25%,2.0,0.0,21.0,0.0,0.0,7.8958,72.0
50%,3.0,0.0,28.0,0.0,0.0,14.4542,155.0
75%,3.0,1.0,39.0,1.0,0.0,31.275,256.0
max,3.0,1.0,80.0,8.0,9.0,512.3292,328.0


- Recall we can subset our columns with `[]`

In [6]:
titanic_data.columns

Index(['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket',
       'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest'],
      dtype='object')

- We can determine which `percentiles` of selected columns to return by combining the column subsetting via selection brackets `[]` and the `.describe()` method

In [8]:
titanic_data[["age", "sibsp", "parch", "fare"]].describe(percentiles = [0.05, 0.25, 0.99])

Unnamed: 0,age,sibsp,parch,fare
count,1046.0,1309.0,1309.0,1308.0
mean,29.881135,0.498854,0.385027,33.295479
std,14.4135,1.041658,0.86556,51.758668
min,0.1667,0.0,0.0,0.0
5%,5.0,0.0,0.0,7.225
25%,21.0,0.0,0.0,7.8958
50%,28.0,0.0,0.0,14.4542
99%,65.0,5.0,4.0,262.375
max,80.0,8.0,9.0,512.3292


---

## Determine Rate of Missing Values

- Use `is.null()` method to determine the missing values

In [9]:
titanic_data.isnull()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,False,False,False,False,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,False,False,False,True,False
2,False,False,False,False,False,False,False,False,False,False,False,True,True,False
3,False,False,False,False,False,False,False,False,False,False,False,True,False,False
4,False,False,False,False,False,False,False,False,False,False,False,True,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1305,False,False,False,False,True,False,False,False,False,True,False,True,True,True
1306,False,False,False,False,False,False,False,False,False,True,False,True,False,True
1307,False,False,False,False,False,False,False,False,False,True,False,True,True,True
1308,False,False,False,False,False,False,False,False,False,True,False,True,True,True


- Yikes! Can't make heads or tails of that.
- This is a `DataFrame` of booleans!
- Use the `.sum()` method to see how many `null` values we have for each column

In [10]:
titanic_data.isnull().sum()

Unnamed: 0,0
pclass,1
survived,1
name,1
sex,1
age,264
sibsp,1
parch,1
ticket,1
fare,2
cabin,1015


- This type of multiple method use gives us a good chance to use our `\` operator to create more readable code by making it multi-line

In [16]:
titanic_data.isnull() \
  .sum()

Unnamed: 0,0
pclass,1
survived,1
name,1
sex,1
age,264
sibsp,1
parch,1
ticket,1
fare,2
cabin,1015


---

## Clean Up Data As Needed

- We can remove rows with missing using `.dropna()` method
- First, remove the `cabin`, `boat`, and `body` variables since they have so many missing values
  + If we want to just remove some columns, can use the `.drop()` method

In [14]:
sub_titanic_data = titanic_data.drop(columns = ["body", "cabin", "boat"])
sub_titanic_data.shape

(1310, 11)

- Check on the missingness now

In [15]:
sub_titanic_data.isnull().sum()

Unnamed: 0,0
pclass,1
survived,1
name,1
sex,1
age,264
sibsp,1
parch,1
ticket,1
fare,2
embarked,3


- Now we are ready to use the `.dropna()` method to remove any rows with missing data

In [18]:
temp = sub_titanic_data.dropna()
temp.shape #notice the reduction in rows

(684, 11)

In [19]:
temp.isnull().sum() #no more missing values

Unnamed: 0,0
pclass,0
survived,0
name,0
sex,0
age,0
sibsp,0
parch,0
ticket,0
fare,0
embarked,0


- Usually, you don't want to drop all the rows with any missing data as you are throwing out useful info.
- One option is to impute the missing values... this can be dangerous but can be done with `.fillna()` method

In [20]:
sub_titanic_data.fillna(value = 0) #note, for instance, some values of age are 0 now and the last row is all 0 values

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,embarked,home.dest
0,1.0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0000,0.0,0.0,24160,211.3375,S,"St Louis, MO"
1,1.0,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.5500,S,"Montreal, PQ / Chesterville, ON"
2,1.0,0.0,"Allison, Miss. Helen Loraine",female,2.0000,1.0,2.0,113781,151.5500,S,"Montreal, PQ / Chesterville, ON"
3,1.0,0.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1.0,2.0,113781,151.5500,S,"Montreal, PQ / Chesterville, ON"
4,1.0,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1.0,2.0,113781,151.5500,S,"Montreal, PQ / Chesterville, ON"
...,...,...,...,...,...,...,...,...,...,...,...
1305,3.0,0.0,"Zabour, Miss. Thamine",female,0.0000,1.0,0.0,2665,14.4542,C,0
1306,3.0,0.0,"Zakarian, Mr. Mapriededer",male,26.5000,0.0,0.0,2656,7.2250,C,0
1307,3.0,0.0,"Zakarian, Mr. Ortin",male,27.0000,0.0,0.0,2670,7.2250,C,0
1308,3.0,0.0,"Zimmerman, Mr. Leo",male,29.0000,0.0,0.0,315082,7.8750,S,0


- Can set the value you want to impute by passing a dictionary of key/value pairs

In [21]:
sub_titanic_data.fillna(value = {"home.dest": "Unknown", "age": 200})

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,embarked,home.dest
0,1.0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0000,0.0,0.0,24160,211.3375,S,"St Louis, MO"
1,1.0,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.5500,S,"Montreal, PQ / Chesterville, ON"
2,1.0,0.0,"Allison, Miss. Helen Loraine",female,2.0000,1.0,2.0,113781,151.5500,S,"Montreal, PQ / Chesterville, ON"
3,1.0,0.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1.0,2.0,113781,151.5500,S,"Montreal, PQ / Chesterville, ON"
4,1.0,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1.0,2.0,113781,151.5500,S,"Montreal, PQ / Chesterville, ON"
...,...,...,...,...,...,...,...,...,...,...,...
1305,3.0,0.0,"Zabour, Miss. Thamine",female,200.0000,1.0,0.0,2665,14.4542,C,Unknown
1306,3.0,0.0,"Zakarian, Mr. Mapriededer",male,26.5000,0.0,0.0,2656,7.2250,C,Unknown
1307,3.0,0.0,"Zakarian, Mr. Ortin",male,27.0000,0.0,0.0,2670,7.2250,C,Unknown
1308,3.0,0.0,"Zimmerman, Mr. Leo",male,29.0000,0.0,0.0,315082,7.8750,S,Unknown


---

## Investigate distributions  

- How to summarize data depends on the type of data  

  + Categorical (Qualitative) variable - entries are a label or attribute   
  + Numeric (Quantitative) variable - entries are a numerical value where math can be performed

- Numerical summaries (across subgroups)  

    + Contingency Tables (for categorical data)
    + Mean/Median  
    + Standard Deviation/Variance/IQR
    + Quantiles/Percentiles

- Graphical summaries (across subgroups)  

    + Bar plots (for categorical data)
    + Histograms  
    + Box plots  
    + Scatter plots

---

### Categorical Data

Goal: Describe the **distribution** of the variable  

- Distribution = pattern and frequency with which you observe a variable  
- Categorical variable - entries are a label or attribute   
  + Describe the relative frequency (or count) for each category
  + Using `pandas` `.value_counts()` method and `crosstab()` function

Variables of interest for this section:
  + embarked (where journey started)  


In [22]:
sub_titanic_data.embarked[0:2]

Unnamed: 0,embarked
0,S
1,S


In [23]:
type(sub_titanic_data.embarked[0])

str

The `str` type isn't ideal for summarizaitons. A different data type is better!

#### Category Type Variables

A category type variable is really useful for categorical variables.

- Akin to a factor variable in `R` (if you know those)
- Can have more descriptive labels, ordering of categories, etc.

Let's give the `embarked` variable more descriptive values and by converting it to a `category` type and manipulating it that way.

In [24]:
sub_titanic_data["embarkedC"] = sub_titanic_data.embarked.astype("category")
sub_titanic_data.embarkedC[0:2]

Unnamed: 0,embarkedC
0,S
1,S


- Now we can use the `.cat.rename_categories()` method on this `category` variable

In [25]:
sub_titanic_data.embarkedC = sub_titanic_data.embarkedC.cat.rename_categories(["Cherbourg", "Queenstown", "Southampton"])
sub_titanic_data.embarkedC[0:2]

Unnamed: 0,embarkedC
0,Southampton
1,Southampton


Way better! Now let's grab two more categorical variables and do similar things:

+ sex (Male or Female)  
+ survived (survived or died)  

In [26]:
#convert sec variable
sub_titanic_data["sexC"] = sub_titanic_data.sex.astype("category")
sub_titanic_data.sexC = sub_titanic_data.sexC.cat.rename_categories(["Female", "Male"])
#convert survived variable
sub_titanic_data["survivedC"] = sub_titanic_data.survived.astype("category")
sub_titanic_data.survivedC = sub_titanic_data.survivedC.cat.rename_categories(["Died", "Survived"])

---

#### Contingency tables

- Tables of counts are the main numerical summary for categorical data
- Create **one-way contingency tables** (`.value_counts()` method) (one-way because we are looking at one variable at a time)


In [27]:
sub_titanic_data.embarkedC.value_counts(dropna = False)

Unnamed: 0_level_0,count
embarkedC,Unnamed: 1_level_1
Southampton,914
Cherbourg,270
Queenstown,123
,3


In [28]:
sub_titanic_data.survivedC.value_counts()

Unnamed: 0_level_0,count
survivedC,Unnamed: 1_level_1
Died,809
Survived,500


In [29]:
sub_titanic_data.sexC.value_counts()

Unnamed: 0_level_0,count
sexC,Unnamed: 1_level_1
Male,843
Female,466


- Alternatively, we can find a one-way table using the `pd.cross_tab()` function
  - This function is meant to take two columns (or more) and return tabulations between those two variables
  - We can define a dummy variable to cross with
  - `index` argument is the row variable and `columns` argument is the column variable

In [30]:
sub_titanic_data["dummy"] = 0
pd.crosstab(index = sub_titanic_data.embarkedC, columns = sub_titanic_data.dummy)

dummy,0
embarkedC,Unnamed: 1_level_1
Cherbourg,270
Queenstown,123
Southampton,914


In [31]:
pd.crosstab(index = sub_titanic_data.sexC, columns = sub_titanic_data.dummy)

dummy,0
sexC,Unnamed: 1_level_1
Female,466
Male,843


- To summarize two categorical variables together, we use a two-way contingency table
- Now the `cross_tab()` function can be used more naturally

In [32]:
pd.crosstab(
  sub_titanic_data.embarkedC, #index variable
  sub_titanic_data.survivedC) #column variable

survivedC,Died,Survived
embarkedC,Unnamed: 1_level_1,Unnamed: 2_level_1
Cherbourg,120,150
Queenstown,79,44
Southampton,610,304


In [33]:
pd.crosstab(
  sub_titanic_data.sexC,
  sub_titanic_data.survivedC)

survivedC,Died,Survived
sexC,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,127,339
Male,682,161


- Add marginal totals with `margins = True` argument

In [34]:
pd.crosstab(
  sub_titanic_data.embarkedC,
  sub_titanic_data.survivedC,
  margins = True)

survivedC,Died,Survived,All
embarkedC,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Cherbourg,120,150,270
Queenstown,79,44,123
Southampton,610,304,914
All,809,498,1307


- Add row and columns names for clarity
  + Use `rownames` and `colnames` arguments


In [35]:
pd.crosstab(
  sub_titanic_data.embarkedC,
  sub_titanic_data.survivedC,
  margins = True,
  rownames = ["Embarked Port"],
  colnames = ["Survival Status"]
  )

Survival Status,Died,Survived,All
Embarked Port,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Cherbourg,120,150,270
Queenstown,79,44,123
Southampton,610,304,914
All,809,498,1307


That looks great!

For more than two variables we can create tables but they get harder to read. For instance, we can look at a three-way contingency table:

In [36]:
pd.crosstab(
  [sub_titanic_data.embarkedC, sub_titanic_data.survivedC], #pass a list of columns for the rows
  sub_titanic_data.sexC,
  margins = True)

Unnamed: 0_level_0,sexC,Female,Male,All
embarkedC,survivedC,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Cherbourg,Died,11,109,120
Cherbourg,Survived,102,48,150
Queenstown,Died,23,56,79
Queenstown,Survived,37,7,44
Southampton,Died,93,517,610
Southampton,Survived,198,106,304
All,,464,843,1307


- We can add in names for more clarity

In [38]:
my_tab = pd.crosstab(
  [sub_titanic_data.embarkedC, sub_titanic_data.survivedC],
  sub_titanic_data.sexC,
  margins = True,
  rownames = ['Embarked Port', 'Survival Status'], #a list similar to how the rows were passed
  colnames = ['Sex'])
my_tab

Unnamed: 0_level_0,Sex,Female,Male,All
Embarked Port,Survival Status,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Cherbourg,Died,11,109,120
Cherbourg,Survived,102,48,150
Queenstown,Died,23,56,79
Queenstown,Survived,37,7,44
Southampton,Died,93,517,610
Southampton,Survived,198,106,304
All,,464,843,1307


We might want to subset the returned table to get certain values...

- Note that the `crosstab()` function returns a data frame!

In [39]:
type(my_tab)

In [40]:
my_tab.columns # columns of the data frame

Index(['Female', 'Male', 'All'], dtype='object', name='Sex')

In [41]:
my_tab.index #rows of the data frame, these are tuples!

MultiIndex([(  'Cherbourg',     'Died'),
            (  'Cherbourg', 'Survived'),
            ( 'Queenstown',     'Died'),
            ( 'Queenstown', 'Survived'),
            ('Southampton',     'Died'),
            ('Southampton', 'Survived'),
            (        'All',         '')],
           names=['Embarked Port', 'Survival Status'])

- Can obtain **conditional** bivariate info via subsetting!
- The `MultiIndex` can be tough but let's look at some examples

- Below returns the embarked vs survived table for females

In [42]:
my_tab["Female"]

Unnamed: 0_level_0,Unnamed: 1_level_0,Female
Embarked Port,Survival Status,Unnamed: 2_level_1
Cherbourg,Died,11
Cherbourg,Survived,102
Queenstown,Died,23
Queenstown,Survived,37
Southampton,Died,93
Southampton,Survived,198
All,,464


In [43]:
my_tab.loc[:, "Female"] #.loc way of doing this, : gives all of that index

Unnamed: 0_level_0,Unnamed: 1_level_0,Female
Embarked Port,Survival Status,Unnamed: 2_level_1
Cherbourg,Died,11
Cherbourg,Survived,102
Queenstown,Died,23
Queenstown,Survived,37
Southampton,Died,93
Southampton,Survived,198
All,,464


- Below returns the sex vs embarked table for those that died

In [44]:
my_tab.iloc[0:5:2, :] #0:5:2 gives a shorthand for a sequence with steps of 2s

Unnamed: 0_level_0,Sex,Female,Male,All
Embarked Port,Survival Status,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Cherbourg,Died,11,109,120
Queenstown,Died,23,56,79
Southampton,Died,93,517,610


- Using `.loc[]` is better
- Must understand our `MultiIndex`

In [45]:
my_tab.index

MultiIndex([(  'Cherbourg',     'Died'),
            (  'Cherbourg', 'Survived'),
            ( 'Queenstown',     'Died'),
            ( 'Queenstown', 'Survived'),
            ('Southampton',     'Died'),
            ('Southampton', 'Survived'),
            (        'All',         '')],
           names=['Embarked Port', 'Survival Status'])

- Below uses this index to return the sex vs embarked table for those that died

In [46]:
my_tab.loc[(("Cherbourg", "Queenstown", "Southampton"), "Died"), :]

Unnamed: 0_level_0,Sex,Female,Male,All
Embarked Port,Survival Status,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Cherbourg,Died,11,109,120
Queenstown,Died,23,56,79
Southampton,Died,93,517,610


- Below returns the sex vs survived table for embarked of Cherbourg

In [47]:
my_tab.loc[('Cherbourg', ("Died", "Survived")), :]

Unnamed: 0_level_0,Sex,Female,Male,All
Embarked Port,Survival Status,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Cherbourg,Died,11,109,120
Cherbourg,Survived,102,48,150


- Return the sex table for those that died and embarked at Cherbourg
  + First with `.iloc[]` then with `.loc[]`

In [48]:
my_tab.iloc[0, :]

Unnamed: 0_level_0,Cherbourg
Unnamed: 0_level_1,Died
Sex,Unnamed: 1_level_2
Female,11
Male,109
All,120


In [49]:
my_tab.loc[('Cherbourg', 'Died')]

Unnamed: 0_level_0,Cherbourg
Unnamed: 0_level_1,Died
Sex,Unnamed: 1_level_2
Female,11
Male,109
All,120


---

### Numeric Data

Goal: Describe the **distribution** of the variable  

- Distribution = pattern and frequency with which you observe a variable  
- Numeric variable - entries are a numerical value where math can be performed

For a single numeric variable, describe the distribution via

+ Shape: Histogram, Density plot, ... (covered later)
+ **Measures of center: Mean, Median, ...**
+ **Measures of spread: Variance, Standard Deviation, Quartiles, IQR, ...**

For two numeric variables, describe the distribution via

+ Shape: Scatter plot, ...
+ **Measures of linear relationship: Covariance, Correlation, ...**

---

#### Measures of Center

- Find mean and median with methods on a `Series`

In [50]:
type(sub_titanic_data['fare'])

- Corresponding methods exist for the common numerical summaries

In [52]:
sub_titanic_data['fare'].mean()

33.29547928134557

In [53]:
sub_titanic_data['fare'].median()

14.4542

In [54]:
sub_titanic_data.age.mean() #same thing with a different way to get a column

29.8811345124283

In [55]:
sub_titanic_data.age.median()

28.0

---

#### Measures of Spread

- Standard Deviation, Quartiles, & IQR found with `Series` methods as well

In [56]:
sub_titanic_data.age.std()

14.413499699923594

In [57]:
sub_titanic_data.age.quantile(q = [0.2, 0.25, 0.5, 0.95])

Unnamed: 0,age
0.2,19.0
0.25,21.0
0.5,28.0
0.95,57.0


In [58]:
q1 = sub_titanic_data.age.quantile(q = [0.25])
q1

Unnamed: 0,age
0.25,21.0


In [59]:
q3 = sub_titanic_data.age.quantile(q = [0.75])
q3

Unnamed: 0,age
0.75,39.0


In [60]:
type(q1)

- As both `q1` and `q3` are `Series`, they have indices
- This makes them a little more difficult than you might like to subtract (to find the IRQ)

In [61]:
q3-q1 #doesn't work due to the differing index names

Unnamed: 0,age
0.25,
0.75,


In [62]:
q3[0.75] - q1[0.25] #grab the values by index names and subtract those

18.0

- Alternatively, remember that returning the `.values` attribute returns a `numpy` array. We can subtract these.

In [63]:
q3.values - q1.values

array([18.])

---

#### Measures of Linear Relationship

- Correlation via the `.corr()` method on a data frame
- This gives the correlation with any numerically (stored) variables that are passed
  + Just because it is stored numerically doesn't mean we should treat it numerically!

In [68]:
sub_titanic_data[["age", "fare", "sibsp", "parch"]].corr()

Unnamed: 0,age,fare,sibsp,parch
age,1.0,0.178739,-0.243699,-0.150917
fare,0.178739,1.0,0.160238,0.221539
sibsp,-0.243699,0.160238,1.0,0.373587
parch,-0.150917,0.221539,0.373587,1.0


---

### Summaries Across Groups

Usually want summaries for different **subgroups of data**

Two approaches we'll cover:
- Use `.groupby()` method and then use a summarization method
- Use `pd.crosstab()` function with `aggfunc` argument

#### `.groupby()` Examples

Example: Get similar fare summaries for each *survival status*

In [69]:
sub_titanic_data.groupby("survivedC")[["age", "fare", "sibsp", "parch"]].mean()

  sub_titanic_data.groupby("survivedC")[["age", "fare", "sibsp", "parch"]].mean()


Unnamed: 0_level_0,age,fare,sibsp,parch
survivedC,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Died,30.545369,23.353831,0.521632,0.328801
Survived,28.918228,49.361184,0.462,0.476


In [70]:
sub_titanic_data.groupby("survivedC")[["age", "fare", "sibsp", "parch"]].std()

  sub_titanic_data.groupby("survivedC")[["age", "fare", "sibsp", "parch"]].std()


Unnamed: 0_level_0,age,fare,sibsp,parch
survivedC,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Died,13.922539,34.145096,1.210449,0.912332
Survived,15.061481,68.648795,0.685197,0.776292


- `.unstack()` method on the result can sometimes make the output clearer

In [71]:
sub_titanic_data.groupby("survivedC")[["age", "fare", "sibsp", "parch"]].mean().unstack()

  sub_titanic_data.groupby("survivedC")[["age", "fare", "sibsp", "parch"]].mean().unstack()


Unnamed: 0_level_0,Unnamed: 1_level_0,0
Unnamed: 0_level_1,survivedC,Unnamed: 2_level_1
age,Died,30.545369
age,Survived,28.918228
fare,Died,23.353831
fare,Survived,49.361184
sibsp,Died,0.521632
sibsp,Survived,0.462
parch,Died,0.328801
parch,Survived,0.476


- Multiple grouping variables can be given as a list

  - Example: Get summary for numeric type variables for each *survival status* and *embarked port*

In [72]:
sub_titanic_data.groupby(["survivedC", "embarkedC"])[["age", "fare", "sibsp", "parch"]].mean()

  sub_titanic_data.groupby(["survivedC", "embarkedC"])[["age", "fare", "sibsp", "parch"]].mean()


Unnamed: 0_level_0,Unnamed: 1_level_0,age,fare,sibsp,parch
survivedC,embarkedC,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Died,Cherbourg,34.46875,40.255592,0.316667,0.225
Died,Queenstown,30.202703,11.615349,0.379747,0.177215
Died,Southampton,29.945385,21.54616,0.580328,0.368852
Survived,Cherbourg,31.037248,80.000807,0.466667,0.486667
Survived,Queenstown,24.153846,13.833998,0.272727,0.0
Survived,Southampton,27.989881,39.18347,0.490132,0.542763


In [73]:
sub_titanic_data.groupby(["survivedC", "embarkedC"])[["age", "fare", "sibsp", "parch"]].std()

  sub_titanic_data.groupby(["survivedC", "embarkedC"])[["age", "fare", "sibsp", "parch"]].std()


Unnamed: 0_level_0,Unnamed: 1_level_0,age,fare,sibsp,parch
survivedC,embarkedC,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Died,Cherbourg,14.655181,56.553704,0.518293,0.55704
Died,Queenstown,16.785187,10.92224,1.016578,0.655538
Died,Southampton,13.496871,28.78602,1.320897,0.990934
Survived,Cherbourg,15.523752,97.642219,0.57541,0.730327
Survived,Queenstown,7.057457,17.50385,0.58523,0.0
Survived,Southampton,14.926867,47.656409,0.744552,0.831405


- As our code gets longer, this is a good place to use `\` to extend our code down a line

In [74]:
sub_titanic_data \
  .groupby(["survivedC", "embarkedC"]) \
   [["age", "fare", "sibsp", "parch"]] \
   .mean()

  .groupby(["survivedC", "embarkedC"]) \


Unnamed: 0_level_0,Unnamed: 1_level_0,age,fare,sibsp,parch
survivedC,embarkedC,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Died,Cherbourg,34.46875,40.255592,0.316667,0.225
Died,Queenstown,30.202703,11.615349,0.379747,0.177215
Died,Southampton,29.945385,21.54616,0.580328,0.368852
Survived,Cherbourg,31.037248,80.000807,0.466667,0.486667
Survived,Queenstown,24.153846,13.833998,0.272727,0.0
Survived,Southampton,27.989881,39.18347,0.490132,0.542763


#### `pd.crosstab()` Examples

- Alternatively we can use the `pd.crosstab()` function with an `aggfunc` to define our summarization to produce

Example: Get summary for numeric type variables for each *survival status*

- A bit awkward in this case as we don't really have a 'column' variable
- Make a dummy variable for that

In [76]:
pd.crosstab(
  sub_titanic_data.survivedC,
  columns = ["mean" for _ in range(sub_titanic_data.shape[0])], #create variable with only the value 'mean'
  values = sub_titanic_data.fare,
  aggfunc = 'mean')

col_0,mean
survivedC,Unnamed: 1_level_1
Died,23.353831
Survived,49.361184


- Can return multiple summaries at once by passing them as a `list`

In [77]:
pd.crosstab(
  sub_titanic_data.survivedC,
  columns = ["stat" for _ in range(sub_titanic_data.shape[0])],
  values = sub_titanic_data.fare,
  aggfunc = ['mean', 'median', 'std', 'count'])

Unnamed: 0_level_0,mean,median,std,count
col_0,stat,stat,stat,stat
survivedC,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Died,23.353831,10.5,34.145096,808
Survived,49.361184,26.0,68.648795,500


- More natural with two grouping variables

  - Example: Get summary for numeric type variables for each *survival status* and *embarked port*

In [78]:
pd.crosstab(
  sub_titanic_data.embarkedC,
  sub_titanic_data.survivedC,
  values = sub_titanic_data.fare,
  aggfunc = ['mean', 'count'])

Unnamed: 0_level_0,mean,mean,count,count
survivedC,Died,Survived,Died,Survived
embarkedC,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Cherbourg,40.255592,80.000807,120,150
Queenstown,11.615349,13.833998,79,44
Southampton,21.54616,39.18347,609,304


---

# Quick Video

This video shows an example of reading in some data and finding numeric summaries!

We'll look at the `.pivot_table()` and `.agg()` methods.

Remember to pop the video out into the full player.

The notebook written in the video is [available here](https://colab.research.google.com/github/jbpost2/ST-554-Big-Data-with-Python/blob/main/01_Programming_in_python/Learning_Python.ipynb).

In [79]:
from IPython.display import IFrame
IFrame(src="https://ncsu.hosted.panopto.com/Panopto/Pages/Embed.aspx?id=197e5481-40be-4488-8271-b0ff000f5fd2&autoplay=false&offerviewer=true&showtitle=true&showbrand=true&captions=false&interactivity=all", height="405", width="720")

---

# Recap

EDA is often the first step to an analysis:

- Must understand the type of data you have/missingness/data validation
- Then describe the distributions of the variables
- Numerical summaries

    + Contingency Tables: `pd.crosstab()`  
    + Mean/Median: `.mean()`, `.median()` methods on a data frame
    + Standard Deviation/quantiles: `.std()`, `.quantile()`  methods

- Across subgroups with `.groupby()` method or `pd.crosstab(value, aggfunc)`

- You can [fancy up output](https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html) too!

That wraps up the material for week 3! Head to the <a href = "https://wolfware.ncsu.edu/" target = "_blank">Moodle site</a> to work on your next homework assignment.

If you are on the course website, use the table of contents on the left or the arrows at the bottom of this page to navigate to the next learning material!

If you are on Google Colab, head back to our course website for [our next lesson](https://jbpost2.github.io/ST-554-Big-Data-with-Python/01_Programming_in_python/18-More_Function_Writing.html)!