# Data Science with `Python` Practice

This is your first practice notebooks. The purpose of these practices is to reiterate some of the content that you went over in the lab, as well as introduce some new material to you with a bit of a guiding helping along the way. Unlike the labs, these notebooks will be incomplete in the sense that you will actively be editing/writing code to modify/produce output. The skeleton is already here but throughout these practice notebooks, we will be asking you to add the rest of the corpus. In doing so, you will hone your data science techniques as well as learn how to search for solutions to your programming hurdles.

This practice will be going over the fundamentals of data science with Python. Much of the content will be similar to your lab, [Introduction to Data Science with Python](../labs/intro_data_science_python.ipynb), and thus it will serve as a good guide post to answering some of the questions. We'll begin today with reading in the data...

## Read in Data 
For this practice we will be using a different baby names dataset.

We want to read in the data without using any libraries.

In [2]:
with open('/dsa/data/all_datasets/baby-names/NationalNames2.csv', 'r') as file:
    data = file.read()
    print(repr(data[0:101]))


'Id,Name,Year,Gender,Count\n4,Elizabeth,1880,F,1939\n8,Alice,1880,F,1414\n12,Clara,1880,F,1226\n13,Ella,18'


Currently, we are only able to use the indexes to locate specific characters in all of the data; this includes some unwanted characters such as commas and new line characters. In other words, all of the data are stored in a single string which is not very useful.

**Activity 1**: *Read in the file so that it is a `list of lists`. In other words, I should be able to access each row individually as well as individual values within the row.* 

In [3]:
# Code for activity 1 goes here
# -----------------------------
split = data.split('\n')
list_of_lists = []
for line in split:
    row = line.split(',')
    list_of_lists.append(row)


To make sure everyone is working with the data loaded the same way, we are going to go ahead and read in the data using the `csv` library. Remembering that there is a lot of data to work with here so we are going to go ahead and subset it.

In [4]:
import csv

# create a list of lists with csv library and store the data in a `data_list` variable
data_list = list(csv.reader(open('/dsa/data/all_datasets/baby-names/NationalNames2.csv'),  delimiter=','))

# create a subset of the entire data set to speed things up
subset = data_list[1:301]

Take the following scenario:

Imagine that we want to find those names in the data set that are not that common. Let's go ahead and classify names that have a `Count` less than 30 as being not that popular. This is almost the bit of code we need to find all of those rows that are less than 30 but we are getting an error.

**Activity 2**: *In the second cell below, correct (de-bug) the following code and answer the following questions.*

In [6]:
for row in subset:
    if row[4] < 30:
        print(row[1])

TypeError: '<' not supported between instances of 'str' and 'int'

**Questions**:
1. What does the following error mean? 
2. How would you correct it so that the names that have less than 30 people who are named that are `print`ed out? 

In [10]:
# Code for activity 2 goes here
# -----------------------------
# 1. Answer the question here as a comment
print('It means that you cannot use >, <, to compare a string (how the data has been stored) to an integer')

# 2. Below put the corrected code
for row in subset:
    if int(row[4]) < 30:
        print(row[1])

It means that you cannot use >, <, to compare a string (how the data has been stored) to an integer
Lucretia
Orpha
Alvina
Catharine
Elma
Geneva
Lona
Linda
Zula
Frieda
Joanna
Tennie
Ettie
Letha
Minta
Adah
Margret
Floy
Idella
Juanita
Isabell
Pattie
Vivian
Almeda
Jannie
Kathrine
Lavinia
Susanna
Elsa
Gladys
Vesta
Antoinette
Libbie
Lilian
Lutie
Meda
Zelma
Adelia
Annetta
Antonia
Dona
Iona
Alva
Cecile
Ellie
Evie
Frankie
Helene
Minna
Savannah
Tina
Anita
Dorothea
Nan
Pearlie
Constance
Ila
Jimmie
Lucia
Ludie
Betsy
Hortense
June
Mona
Cathrine
Clyde
Eleanore
Fay
Jenny
Peggy
Abigail
Clemmie
Easter
Emelia
India
Lotta
Mame
Aline
Emmer
Lissie
Mallie
Malvina
Mazie
Robert
Rosina
Theodora
Therese
Altha
Birtie
Claude
Emelie
Erna
Hilma
Juliet
Leonie
Lugenia
Manda
Manerva
Nella
Paulina
Philomena
Sena
Althea
Annabelle
Dell
Dellar
Elinor
Ione
Josiephine
Lavina
Marcia
Margarette
Oda
Patty
Rosalia
Roxanna
Sula
Winnifred
Bernadette
Elena
Elenora
Inga
Kattie
Leslie
Margery
Ocie
Rowena
Shirley
Tabitha
Verdie
Alber

## Data Manipulation with `pandas`

We are going to transition to using `pandas` now. Let's begin by reading in the file...

In [11]:
import pandas as pd

df = pd.read_csv('/dsa/data/all_datasets/baby-names/NationalNames2.csv')

In [12]:
#displaying the first 10 rows
df.head(10)

Unnamed: 0,Id,Name,Year,Gender,Count
0,4,Elizabeth,1880,F,1939
1,8,Alice,1880,F,1414
2,12,Clara,1880,F,1226
3,13,Ella,1880,F,1156
4,18,Nellie,1880,F,995
5,21,Maude,1880,F,858
6,22,Mabel,1880,F,808
7,23,Bessie,1880,F,796
8,24,Jennie,1880,F,793
9,25,Gertrude,1880,F,787


So this looks good, but the `Id` column from the original file is redundant because `pandas` provides our data frame with one already. 

**Activity 3**: *Remove the `Id` column upon reading in the data.*

In [14]:
# Code for activity 3 goes here
# -----------------------------
with open('/dsa/data/all_datasets/baby-names/NationalNames2.csv', 'r') as file:
    df = pd.read_csv(file)
    df.drop('Id', 1, inplace=True)
    
df.head()



print("this is how to do it when not deleting as data is being read in ")
del df['Id']



We now want to subset the data frame to only display rows for female names. Remember, here is how we do that in `pandas`. 

In [15]:
females = df[df['Gender'] == 'F']

Remember though, we are trying to find names that are not very common.

**Activity 4**: *From this subset of female names, return a data frame with those names who have less than 30 for their count. Name this data frame `uncommon_f`.*

In [20]:
# Code for activity 4 goes here 
# -----------------------------
uncommon_f = females[females['Count'] < 30]

print(uncommon_f.head(n = 5))

          Name  Year Gender  Count
108   Lucretia  1880      F     29
109      Orpha  1880      F     29
110     Alvina  1880      F     28
111  Catharine  1880      F     28
112       Elma  1880      F     28


Now let's do something similar for male names, but this time we should include both uncommon and very common names in our subset.

**Activity 5**: *Create a data frame of male names that are less than 30 or greater than or equal to 1000 for their count. Name this data frame `com_uncom_m`.*

In [25]:
# Code for activity 5 goes here 
# -----------------------------
male = df[df['Gender']== 'M']
com_uncom_m = male[(male['Count'] >= 1000) | (male['Count'] < 30)]
print(com_uncom_m.head(n=5))

         Name  Year Gender  Count
329   Charles  1880      M   5348
330    Thomas  1880      M   2534
331    Walter  1880      M   1755
425       Ole  1880      M     29
426  Benjiman  1880      M     28


We are going to go ahead and do some sorting now. Remember this bit of code from the lab exercises where we sorted the rows by `Count`.

In [26]:
df.sort_values(by = ['Count'], ascending = True).head(10)

Unnamed: 0,Name,Year,Gender,Count
608891,Zyrin,2014,M,5
110071,Rodolphe,1935,M,5
110072,Romulo,1935,M,5
110073,Rosalie,1935,M,5
110074,Rosser,1935,M,5
110075,Rudie,1935,M,5
110076,Saint,1935,M,5
110077,Sherril,1935,M,5
110078,Silberio,1935,M,5
110070,Rochester,1935,M,5


**Activity 6**: *Now sort the data frame, `df`, by `Year` and alphabetically by `Name`.* 

In [27]:
# Code for activity 6 goes here 
# -----------------------------
df.sort_values(['Year','Name'], ascending=[True,True])


Unnamed: 0,Name,Year,Gender,Count
625,Ab,1880,M,5
63,Abbie,1880,F,71
260,Abby,1880,F,6
178,Abigail,1880,F,12
427,Abner,1880,M,27
...,...,...,...,...
608890,Zyran,2014,M,5
607759,Zyrell,2014,M,7
604169,Zyrihanna,2014,F,5
608891,Zyrin,2014,M,5


Below is one way to find the most popular, by absolute value, name of the entire data set. 

In [30]:
df.sort_values(by = ['Count'], ascending = True).tail(1)

Unnamed: 0,Name,Year,Gender,Count
145632,James,1947,M,94755


But what if we were interested in something a bit more specific? Perhaps, the most popular name during a given year.

**Activity 7**: *Find the most popular female name in the year 1881.*

In [35]:
# Code for activity 7 goes here 
# -----------------------------
year1 = df[df['Year']== 1881]
print(year1.sort_values(by = ['Count'], ascending = True).tail(1))

      Name  Year Gender  Count
984  James  1881      M   5442


This final practice exercise is going to be a challenge. Challenge exercises are meant to encourange you to expand on what you have already learned and search for answers that we may have not explicitly gone over. 

Imagine if we only wanted to find names only starting with a certain letter. 

**Activity 8**: *Create a subset of names from the data set that start with the letter "E". Name this data frame `starts_with_e`.*

In [38]:
# Code for activity 8 goes here 
# -----------------------------
starts_with_e = df[df['Name'].str.match('E')]
print(starts_with_e.head(5))

         Name  Year Gender  Count
0   Elizabeth  1880      F   1939
3        Ella  1880      F   1156
14      Ethel  1880      F    633
16        Eva  1880      F    614
23      Effie  1880      F    406


# Save your notebook, then `File > Close and Halt`