# Data Science with `Python` Practice

This is your first practice notebooks. The purpose of these practices is to reiterate some of the content that you went over in the lab, as well as introduce some new material to you with a bit of a guiding helping along the way. Unlike the labs, these notebooks will be incomplete in the sense that you will actively be editing/writing code to modify/produce output. The skeleton is already here but throughout these practice notebooks, we will be asking you to add the rest of the corpus. In doing so, you will hone your data science techniques as well as learn how to search for solutions to your programming hurdles.

This practice will be going over the fundamentals of data science with Python. Much of the content will be similar to your lab, [Introduction to Data Science with Python](../labs/intro_data_science_python.ipynb), and thus it will serve as a good guide post to answering some of the questions. We'll begin today with reading in the data...

## Read in Data 
For this practice we will be using a different baby names dataset.

We want to read in the data without using any libraries.

In [1]:
with open('/dsa/data/all_datasets/baby-names/NationalNames2.csv', 'r') as file:
    data = file.read()
    print(repr(data[0:101]))


'Id,Name,Year,Gender,Count\n4,Elizabeth,1880,F,1939\n8,Alice,1880,F,1414\n12,Clara,1880,F,1226\n13,Ella,18'


Currently, we are only able to use the indexes to locate specific characters in all of the data; this includes some unwanted characters such as commas and new line characters. In other words, all of the data are stored in a single string which is not very useful.

**Activity 1**: *Read in the file so that it is a `list of lists`. In other words, I should be able to access each row individually as well as individual values within the row.* 

In [3]:
# Code for activity 1 goes here
# -----------------------------
with open('/dsa/data/all_datasets/baby-names/NationalNames2.csv', 'r') as file:
    data = file.read()
    
    data_lists = data.split('\n')
    
    data_lol = []
    
    for line in data_lists:
        row = line.split(',')
        data_lol.append(row)
        
print(data_lol[0:10])


[['Id', 'Name', 'Year', 'Gender', 'Count'], ['4', 'Elizabeth', '1880', 'F', '1939'], ['8', 'Alice', '1880', 'F', '1414'], ['12', 'Clara', '1880', 'F', '1226'], ['13', 'Ella', '1880', 'F', '1156'], ['18', 'Nellie', '1880', 'F', '995'], ['21', 'Maude', '1880', 'F', '858'], ['22', 'Mabel', '1880', 'F', '808'], ['23', 'Bessie', '1880', 'F', '796'], ['24', 'Jennie', '1880', 'F', '793']]


To make sure everyone is working with the data loaded the same way, we are going to go ahead and read in the data using the `csv` library. Remembering that there is a lot of data to work with here so we are going to go ahead and subset it.

In [4]:
import csv

# create a list of lists with csv library and store the data in a `data_list` variable
data_list = list(csv.reader(open('/dsa/data/all_datasets/baby-names/NationalNames2.csv'),  delimiter=','))

# create a subset of the entire data set to speed things up
subset = data_list[1:301]

In [9]:
len(data_list)
# Out of curiousity

608893

In [10]:
len(subset)

300

In [12]:
#just printing to see a sample 
subset

[['4', 'Elizabeth', '1880', 'F', '1939'],
 ['8', 'Alice', '1880', 'F', '1414'],
 ['12', 'Clara', '1880', 'F', '1226'],
 ['13', 'Ella', '1880', 'F', '1156'],
 ['18', 'Nellie', '1880', 'F', '995'],
 ['21', 'Maude', '1880', 'F', '858'],
 ['22', 'Mabel', '1880', 'F', '808'],
 ['23', 'Bessie', '1880', 'F', '796'],
 ['24', 'Jennie', '1880', 'F', '793'],
 ['25', 'Gertrude', '1880', 'F', '787'],
 ['29', 'Mattie', '1880', 'F', '704'],
 ['31', 'Catherine', '1880', 'F', '688'],
 ['35', 'Helen', '1880', 'F', '636'],
 ['37', 'Louise', '1880', 'F', '635'],
 ['38', 'Ethel', '1880', 'F', '633'],
 ['39', 'Lula', '1880', 'F', '621'],
 ['41', 'Eva', '1880', 'F', '614'],
 ['42', 'Frances', '1880', 'F', '605'],
 ['43', 'Lena', '1880', 'F', '603'],
 ['46', 'Maggie', '1880', 'F', '582'],
 ['48', 'Daisy', '1880', 'F', '564'],
 ['50', 'Josephine', '1880', 'F', '544'],
 ['54', 'Agnes', '1880', 'F', '473'],
 ['63', 'Effie', '1880', 'F', '406'],
 ['65', 'Nettie', '1880', 'F', '403'],
 ['70', 'Maud', '1880', 'F', 

Take the following scenario:

Imagine that we want to find those names in the data set that are not that common. Let's go ahead and classify names that have a `Count` less than 30 as being not that popular. This is almost the bit of code we need to find all of those rows that are less than 30 but we are getting an error.

**Activity 2**: *In the second cell below, correct (de-bug) the following code and answer the following questions.*

In [11]:
for row in subset:
    if row[4] < 30:
        print(row[1])

TypeError: '<' not supported between instances of 'str' and 'int'

**Questions**:
1. What does the following error mean? 
2. How would you correct it so that the names that have less than 30 people who are named that are `print`ed out? 

In [14]:
# Code for activity 2 goes here
# -----------------------------
# 1. Answer the question here as a comment
# The issue here is we are trying to compare string values with integers, which produces the error. 
# In order to make this work, we need to convert the 4th column to integers for comparison or add a new column as numeric
# 2. Below put the corrected code
counter = 0

for row in subset:
    if float(row[4]) < 30:
        print(row[1])
        counter += 1
        
print("The number of names with less than 30 is {}".format(counter))
# There are many ways to accomplish this task depeneding on needs. This way simplifies the process, but it doesn't leave
# us open to additional analysis later if that is needed. Ideally we would coerce the values in row[4] to ints/floats
# or create a new column doing the same

# I will do an alternate way below where we change the dtype in the column 


Lucretia
Orpha
Alvina
Catharine
Elma
Geneva
Lona
Linda
Zula
Frieda
Joanna
Tennie
Ettie
Letha
Minta
Adah
Margret
Floy
Idella
Juanita
Isabell
Pattie
Vivian
Almeda
Jannie
Kathrine
Lavinia
Susanna
Elsa
Gladys
Vesta
Antoinette
Libbie
Lilian
Lutie
Meda
Zelma
Adelia
Annetta
Antonia
Dona
Iona
Alva
Cecile
Ellie
Evie
Frankie
Helene
Minna
Savannah
Tina
Anita
Dorothea
Nan
Pearlie
Constance
Ila
Jimmie
Lucia
Ludie
Betsy
Hortense
June
Mona
Cathrine
Clyde
Eleanore
Fay
Jenny
Peggy
Abigail
Clemmie
Easter
Emelia
India
Lotta
Mame
Aline
Emmer
Lissie
Mallie
Malvina
Mazie
Robert
Rosina
Theodora
Therese
Altha
Birtie
Claude
Emelie
Erna
Hilma
Juliet
Leonie
Lugenia
Manda
Manerva
Nella
Paulina
Philomena
Sena
Althea
Annabelle
Dell
Dellar
Elinor
Ione
Josiephine
Lavina
Marcia
Margarette
Oda
Patty
Rosalia
Roxanna
Sula
Winnifred
Bernadette
Elena
Elenora
Inga
Kattie
Leslie
Margery
Ocie
Rowena
Shirley
Tabitha
Verdie
Albertina
Albina
Alyce
Annis
Doshie
Etna
Eve
Florance
Geraldine
Gina
Grayce
Jossie
Katheryn
Lea
Leanna
Le

In [17]:
# This code changes row[4] elements to floats.
for row in subset:
    row[4] = float(row[4])
    if row[4] < 30:
        print(row[1])

# Looking at the subset, we can see that the elements in the last column are now numeric floats
subset

Lucretia
Orpha
Alvina
Catharine
Elma
Geneva
Lona
Linda
Zula
Frieda
Joanna
Tennie
Ettie
Letha
Minta
Adah
Margret
Floy
Idella
Juanita
Isabell
Pattie
Vivian
Almeda
Jannie
Kathrine
Lavinia
Susanna
Elsa
Gladys
Vesta
Antoinette
Libbie
Lilian
Lutie
Meda
Zelma
Adelia
Annetta
Antonia
Dona
Iona
Alva
Cecile
Ellie
Evie
Frankie
Helene
Minna
Savannah
Tina
Anita
Dorothea
Nan
Pearlie
Constance
Ila
Jimmie
Lucia
Ludie
Betsy
Hortense
June
Mona
Cathrine
Clyde
Eleanore
Fay
Jenny
Peggy
Abigail
Clemmie
Easter
Emelia
India
Lotta
Mame
Aline
Emmer
Lissie
Mallie
Malvina
Mazie
Robert
Rosina
Theodora
Therese
Altha
Birtie
Claude
Emelie
Erna
Hilma
Juliet
Leonie
Lugenia
Manda
Manerva
Nella
Paulina
Philomena
Sena
Althea
Annabelle
Dell
Dellar
Elinor
Ione
Josiephine
Lavina
Marcia
Margarette
Oda
Patty
Rosalia
Roxanna
Sula
Winnifred
Bernadette
Elena
Elenora
Inga
Kattie
Leslie
Margery
Ocie
Rowena
Shirley
Tabitha
Verdie
Albertina
Albina
Alyce
Annis
Doshie
Etna
Eve
Florance
Geraldine
Gina
Grayce
Jossie
Katheryn
Lea
Leanna
Le

[['4', 'Elizabeth', '1880', 'F', 1939.0],
 ['8', 'Alice', '1880', 'F', 1414.0],
 ['12', 'Clara', '1880', 'F', 1226.0],
 ['13', 'Ella', '1880', 'F', 1156.0],
 ['18', 'Nellie', '1880', 'F', 995.0],
 ['21', 'Maude', '1880', 'F', 858.0],
 ['22', 'Mabel', '1880', 'F', 808.0],
 ['23', 'Bessie', '1880', 'F', 796.0],
 ['24', 'Jennie', '1880', 'F', 793.0],
 ['25', 'Gertrude', '1880', 'F', 787.0],
 ['29', 'Mattie', '1880', 'F', 704.0],
 ['31', 'Catherine', '1880', 'F', 688.0],
 ['35', 'Helen', '1880', 'F', 636.0],
 ['37', 'Louise', '1880', 'F', 635.0],
 ['38', 'Ethel', '1880', 'F', 633.0],
 ['39', 'Lula', '1880', 'F', 621.0],
 ['41', 'Eva', '1880', 'F', 614.0],
 ['42', 'Frances', '1880', 'F', 605.0],
 ['43', 'Lena', '1880', 'F', 603.0],
 ['46', 'Maggie', '1880', 'F', 582.0],
 ['48', 'Daisy', '1880', 'F', 564.0],
 ['50', 'Josephine', '1880', 'F', 544.0],
 ['54', 'Agnes', '1880', 'F', 473.0],
 ['63', 'Effie', '1880', 'F', 406.0],
 ['65', 'Nettie', '1880', 'F', 403.0],
 ['70', 'Maud', '1880', 'F', 

## Data Manipulation with `pandas`

We are going to transition to using `pandas` now. Let's begin by reading in the file...

In [18]:
import pandas as pd

df = pd.read_csv('/dsa/data/all_datasets/baby-names/NationalNames2.csv')

In [19]:
#displaying the first 10 rows
df.head(10)

Unnamed: 0,Id,Name,Year,Gender,Count
0,4,Elizabeth,1880,F,1939
1,8,Alice,1880,F,1414
2,12,Clara,1880,F,1226
3,13,Ella,1880,F,1156
4,18,Nellie,1880,F,995
5,21,Maude,1880,F,858
6,22,Mabel,1880,F,808
7,23,Bessie,1880,F,796
8,24,Jennie,1880,F,793
9,25,Gertrude,1880,F,787


So this looks good, but the `Id` column from the original file is redundant because `pandas` provides our data frame with one already. 

**Activity 3**: *Remove the `Id` column upon reading in the data.*

In [20]:
# Code for activity 3 goes here
# -----------------------------

del df['Id']

# Note, running this code again will produce an error as the 'Id' column is already deleted.


Unnamed: 0,Name,Year,Gender,Count
0,Elizabeth,1880,F,1939
1,Alice,1880,F,1414
2,Clara,1880,F,1226
3,Ella,1880,F,1156
4,Nellie,1880,F,995
...,...,...,...,...
608887,Zirui,2014,M,5
608888,Zo,2014,M,5
608889,Zyel,2014,M,5
608890,Zyran,2014,M,5


We now want to subset the data frame to only display rows for female names. Remember, here is how we do that in `pandas`. 

In [21]:
females = df[df['Gender'] == 'F']

Remember though, we are trying to find names that are not very common.

**Activity 4**: *From this subset of female names, return a data frame with those names who have less than 30 for their count. Name this data frame `uncommon_f`.*

In [22]:
# Code for activity 4 goes here 
# -----------------------------

uncommon_f = females[females['Count'] < 30]

uncommon_f.head(10)


Unnamed: 0,Name,Year,Gender,Count
108,Lucretia,1880,F,29
109,Orpha,1880,F,29
110,Alvina,1880,F,28
111,Catharine,1880,F,28
112,Elma,1880,F,28
113,Geneva,1880,F,28
114,Lona,1880,F,28
115,Linda,1880,F,27
116,Zula,1880,F,27
117,Frieda,1880,F,26


Now let's do something similar for male names, but this time we should include both uncommon and very common names in our subset.

**Activity 5**: *Create a data frame of male names that are less than 30 or greater than or equal to 1000 for their count. Name this data frame `com_uncom_m`.*

In [25]:
# Code for activity 5 goes here 
# -----------------------------

# This code combines all the logical definitions in one line
com_uncom_m = df[(df['Gender'] == 'M') & (df['Count'] < 30) | (df['Count'] > 1000)]

# Let's look at head and tail to make sure both were captured
print(com_uncom_m.head(10))
print(com_uncom_m.tail(10))

          Name  Year Gender  Count
0    Elizabeth  1880      F   1939
1        Alice  1880      F   1414
2        Clara  1880      F   1226
3         Ella  1880      F   1156
329    Charles  1880      M   5348
330     Thomas  1880      M   2534
331     Walter  1880      M   1755
425        Ole  1880      M     29
426   Benjiman  1880      M     28
427      Abner  1880      M     27
            Name  Year Gender  Count
608882   Zaymere  2014      M      5
608883  Zekeriah  2014      M      5
608884     Zenas  2014      M      5
608885     Ziion  2014      M      5
608886     Zijun  2014      M      5
608887     Zirui  2014      M      5
608888        Zo  2014      M      5
608889      Zyel  2014      M      5
608890     Zyran  2014      M      5
608891     Zyrin  2014      M      5


We are going to go ahead and do some sorting now. Remember this bit of code from the lab exercises where we sorted the rows by `Count`.

In [26]:
df.sort_values(by = ['Count'], ascending = True).head(10)

Unnamed: 0,Name,Year,Gender,Count
608891,Zyrin,2014,M,5
110071,Rodolphe,1935,M,5
110072,Romulo,1935,M,5
110073,Rosalie,1935,M,5
110074,Rosser,1935,M,5
110075,Rudie,1935,M,5
110076,Saint,1935,M,5
110077,Sherril,1935,M,5
110078,Silberio,1935,M,5
110070,Rochester,1935,M,5


**Activity 6**: *Now sort the data frame, `df`, by `Year` and alphabetically by `Name`.* 

In [27]:
# Code for activity 6 goes here 
# -----------------------------

df.sort_values(by = ['Year', 'Name'], ascending=True).head(10)


Unnamed: 0,Name,Year,Gender,Count
625,Ab,1880,M,5
63,Abbie,1880,F,71
260,Abby,1880,F,6
178,Abigail,1880,F,12
427,Abner,1880,M,27
452,Abram,1880,M,21
123,Adah,1880,F,24
363,Adam,1880,M,104
551,Addie,1880,M,8
87,Adele,1880,F,41


In [29]:
# Change ascending just to see

df.sort_values(by = ['Year', 'Name'], ascending=False).head(10)

Unnamed: 0,Name,Year,Gender,Count
608259,Zyshawn,2014,M,6
608891,Zyrin,2014,M,5
604169,Zyrihanna,2014,F,5
607759,Zyrell,2014,M,7
608890,Zyran,2014,M,5
602672,Zyonnah,2014,F,7
606482,Zymire,2014,M,12
604937,Zymir,2014,M,55
602671,Zyleigh,2014,F,7
601179,Zykira,2014,F,11


Below is one way to find the most popular, by absolute value, name of the entire data set. 

In [None]:
df.sort_values(by = ['Count'], ascending = True).tail(1)

But what if we were interested in something a bit more specific? Perhaps, the most popular name during a given year.

**Activity 7**: *Find the most popular female name in the year 1881.*

In [33]:
# Code for activity 7 goes here 
# -----------------------------

df_1881 = df[(df['Year'] == 1881) & (df['Gender'] == 'F')] 
df_1881.sort_values(by = ['Count'], ascending=True).tail(1)


Unnamed: 0,Name,Year,Gender,Count
674,Ida,1881,F,1439


This final practice exercise is going to be a challenge. Challenge exercises are meant to encourange you to expand on what you have already learned and search for answers that we may have not explicitly gone over. 

Imagine if we only wanted to find names only starting with a certain letter. 

**Activity 8**: *Create a subset of names from the data set that start with the letter "E". Name this data frame `starts_with_e`.*

In [38]:
# Code for activity 8 goes here 
# -----------------------------

# Below code uses a string slice on the element in the 'Name' column, and searchs for the letter 'E' as the first character
# Since names in the df start with a capital letter, we don't have to worry about case here, but that is something
# to keep in mind in the future and how we would tackle this.

starts_with_e = df[df['Name'].str[0] == 'E']

starts_with_e.head(20)

Unnamed: 0,Name,Year,Gender,Count
0,Elizabeth,1880,F,1939
3,Ella,1880,F,1156
14,Ethel,1880,F,633
16,Eva,1880,F,614
23,Effie,1880,F,406
27,Etta,1880,F,323
29,Elsie,1880,F,301
33,Eliza,1880,F,252
45,Estella,1880,F,162
51,Evelyn,1880,F,122


In [39]:
starts_with_e.tail(20)

Unnamed: 0,Name,Year,Gender,Count
608418,Eker,2014,M,5
608419,Elad,2014,M,5
608420,Eliejah,2014,M,5
608421,Elikai,2014,M,5
608422,Elisee,2014,M,5
608423,Elyjiah,2014,M,5
608424,Emar,2014,M,5
608425,Emiel,2014,M,5
608426,Ephriam,2014,M,5
608427,Equan,2014,M,5


# Save your notebook, then `File > Close and Halt`