# Descriptive Statistics with `Python` Exercises

This is the second exercise for this course. 
Like the prior exercise, all of the concepts in the exercise will be familiar in the sense that it will mirror the concepts that we learned throughout the labs and practices. 

There could be multiple ways to solve a single answer, some more elegant than others. 
Revisit older notebooks for guidance, ask questions along the way, 
and **we encourage you to search the internet**. 
Chances are that someone has had the same/similar question / error to the one you have.

For this exercise, we will be working with the `NationalNames3.csv`. Again, this is in the `/dsa/data/all_datasets/baby-names/` directory.

In [1]:
# Adding a cell above to import packages as needed

import pandas as pd
import numpy as np

**Exercise 1**: *Read in the `NationalNames3.csv` file and name it `df`.*

In [2]:
# Exercise 1 code goes here
# -------------------------

with open('/dsa/data/all_datasets/baby-names/NationalNames3.csv', 'r') as file:
    df = pd.read_csv(file)
    
# print head(10) to sample data and ensure import

df.head(10)


Unnamed: 0,Id,Name,Year,Gender,Count
0,1,Mary,1880,F,7065
1,5,Minnie,1880,F,1746
2,7,Ida,1880,F,1472
3,9,Bertha,1880,F,1320
4,15,Cora,1880,F,1045
5,16,Martha,1880,F,1040
6,17,Laura,1880,F,1012
7,19,Grace,1880,F,982
8,20,Carrie,1880,F,949
9,27,Hattie,1880,F,769


We will start out simple.

**Exercise 2**: Find the mean of the `Count` column and how spread out is the data?

In [6]:
# Exercise 2 code goes here
# -------------------------

# Find mean and standard deviation


mean_count = df['Count'].mean()
std_dev_count = df['Count'].std()
min_count = df['Count'].min()
max_count = df['Count'].max()

print("Below are some stats on the 'Count' column of the data frame\n")

print("Count Mean \t\t {}".format(mean_count))
print("Standard Dev \t\t {}".format(std_dev_count))
print("Count Min \t\t {}".format(min_count))
print("Count Max \t\t {}".format(max_count))


Below are some stats on the 'Count' column of the data frame

Count Mean 		 181.71348937247797
Standard Dev 		 1535.8103590811502
Count Min 		 5
Count Max 		 96205


**Exercise 3:** Find the minimum, first quartile, median, third quartile and max count of the name "Sam" in this dataset.

In [7]:
# Exercise 3 code goes here
# -------------------------

# Using the .describe method
df[df['Name'] == 'Sam'].describe()


Unnamed: 0,Id,Year,Count
count,86.0,86.0,86.0
mean,475666.4,1942.837209,608.581395
std,416306.5,32.879409,661.426755
min,9351.0,1884.0,5.0
25%,166359.5,1919.25,16.25
50%,388200.5,1942.0,482.0
75%,632005.2,1964.25,1071.0
max,1778931.0,2013.0,2505.0


In [12]:
# Similar to above, but now using the percentile method to hard code the quartile, etc.
# Check that values are similar but note above we used pd and below we use np

# Creating a subset of Sam
sam = df[df['Name'] == 'Sam']

min_sam = np.quantile(a=sam['Count'], q=0.0)
first_quartile = np.quantile(a=sam['Count'], q=0.25)
median_sam = np.quantile(a=sam['Count'], q=0.5)
third_quartile = np.quantile(a=sam['Count'], q=0.75)
max_sam = np.quantile(a=sam['Count'], q=1)

print("Min count of Sam \t\t {}".format(min_sam))
print("First Quart count of Sam \t {}".format(first_quartile))
print("Median count of Sam \t\t {}".format(median_sam))
print("Third Quart count of Sam \t {}".format(third_quartile))
print("Max count of Sam \t\t {}".format(max_sam))

Min count of Sam 		 5.0
First Quart count of Sam 	 16.25
Median count of Sam 		 482.0
Third Quart count of Sam 	 1071.0
Max count of Sam 		 2505


**Challenge Exercise 1**: On average (use the median), are there more female names per year or male names? Remember, you will have to add the names per year per gender.

In [22]:
# Challenge Exercise 1 code goes here
# -----------------------------------

# subset by gender
females = df[df['Gender'] == 'F']
males = df[df['Gender'] == 'M']

female_median_names = np.median(females.groupby('Year').count()['Name'])
male_median_names = np.median(males.groupby('Year').count()['Name'])

print("Female Median Names per year: \t{}".format(female_median_names))
print("Male Median Names per year: \t{}".format(male_median_names))

# Based on the out below, there are more female names per year than male names

Female Median Names per year: 	2007.0
Male Median Names per year: 	1516.0


**Exercise 4**: For the name "Margaret" over the years, 65% is equal or below what Count? 


In [23]:
# Exercise 4 code goes here
# -------------------------

margarets = df[df['Name'] == 'Margaret']

print("65% of Margaret names are below {}".format(np.quantile(a=margarets['Count'], q=0.65)))


65% of Margaret names are below 3256.4000000000005


**Exercise 5**: Find the covariance and correlation between the Year and Count of the name "Addison".

In [26]:
# Exercise 5 code goes here
# -------------------------

addisons = df[df['Name'] == 'Addison']

addi_cov = addisons.Year.cov(addisons.Count)
addi_corr = addisons.Year.corr(addisons.Count)

print("Covariance between Year and Count of 'Addison': \t{}".format(addi_cov))
print("Correlation between Year and Count of 'Addison': \t{}".format(addi_corr))

Covariance between Year and Count of 'Addison': 	48944.459016393455
Correlation between Year and Count of 'Addison': 	0.4259311439218686


**Challenge Exercise 2**: Does accounting for Gender affect the strength of the linear relationship with regards to Exercise 5? Does one gender have a stronger linear relationship than the other? 


In [28]:
# Challenge Exercise 2 code goes here
# -----------------------------------

addi_f = addisons[addisons['Gender'] == 'F']
addi_m = addisons[addisons['Gender'] == 'M']

print("Female correlation between Year and Count of 'Addison': \t {}".format(addi_f.Year.corr(addi_f.Count)))
print("Male correlation between Year and Count of 'Addison': \t\t {}".format(addi_m.Year.corr(addi_m.Count)))

Female correlation between Year and Count of 'Addison': 	 0.8318457439323934
Male correlation between Year and Count of 'Addison': 		 0.6712916451351595


# Save your notebook, then `File > Close and Halt`