# Data Types for Data Science in Python

Have you got your basic Python programming chops down for Data Science but are yearning for more? Then this is the course for you. Herein, you'll consolidate and practice your knowledge of lists, dictionaries, tuples, sets, and date times. You'll see their relevance in working with lots of real data and how to leverage several of them in concert to solve multistep problems, including an extended case study using Chicago metropolitan area transit data. You'll also learn how to use many of the objects in the Python Collections module, which will allow you to store and manipulate your data for a variety of Data Scientific purposes. After taking this course, you'll be ready to tackle many Data Science challenges Pythonically.


## Fundamental data types
This chapter will introduce you to the fundamental Python data types - lists, sets, and tuples. These data containers are critical as they provide the basis for storing and looping over ordered data. To make things interesting, you'll apply what you learn about these types to answer questions about the New York Baby Names dataset!

### Manipulating lists for fun and profit
You may be familiar with adding individual data elements to a list by using the .append() method. However, if you want to combine a list with another array type (list, set, tuple), you can use the .extend() method on the list.

You can also use the .index() method to find the position of an item in a list. You can then use that position to remove the item with the .pop() method.

In this exercise, you'll practice using all these methods!

In [2]:
# Create a list containing the names: baby_names
baby_names = ['Ximena', 'Aliza', 'Ayden', 'Calvin']

# Use the .extend() method on baby_names to add 'Rowen' 
# and 'Sandeep' and print the list.
baby_names.extend(['Rowen','Sandeep'])

# Print baby_names
print(baby_names)

# Find the position of 'Aliza' using .index(): position
position = baby_names.index('Aliza')

# Remove 'Aliza' from baby_names using .pop()
baby_names.pop(position)

# Print baby_names
print(baby_names)

['Ximena', 'Aliza', 'Ayden', 'Calvin', 'Rowen', 'Sandeep']
['Ximena', 'Ayden', 'Calvin', 'Rowen', 'Sandeep']


In [3]:
girl_names = ['JADA',
 'Emily',
 'Ava',
 'SERENITY',
 'Claire',
 'SOPHIA',
 'Sarah',
 'ASHLEY',
 'CHAYA',
 'ABIGAIL',
 'Zoe',
 'LEAH',
 'HAILEY',
 'AVA',
 'Olivia',
 'EMMA',
 'CHLOE',
 'Sophia',
 'AALIYAH',
 'Angela',
 'Camila',
 'Savannah',
 'Serenity',
 'Chloe',
 'Fatoumata',
 'ISABELLA',
 'MIA',
 'FIONA',
 'Skylar',
 'Ashley',
 'Rachel',
 'Sofia',
 'Alina',
 'MADISON',
 'RACHEL',
 'CAMILA',
 'CHANA',
 'TAYLOR',
 'Kayla',
 'Miriam',
 'Leah',
 'Grace',
 'ANGELA',
 'Isabella',
 'Emma',
 'KAYLA',
 'SOFIA',
 'Madison',
 'Aaliyah',
 'Taylor',
 'GENESIS',
 'Esther',
 'MAKAYLA',
 'Victoria',
 'Chaya',
 'Brielle',
 'Anna',
 'Samantha',
 'ESTHER',
 'GRACE',
 'Mariam',
 'Mia',
 'NEVAEH',
 'GABRIELLE',
 'EMILY',
 'London',
 'TIFFANY',
 'Chana',
 'Valentina',
 'OLIVIA',
 'LONDON',
 'MIRIAM',
 'SARAH',
 'ELLA']



print(girl_names)

['JADA', 'Emily', 'Ava', 'SERENITY', 'Claire', 'SOPHIA', 'Sarah', 'ASHLEY', 'CHAYA', 'ABIGAIL', 'Zoe', 'LEAH', 'HAILEY', 'AVA', 'Olivia', 'EMMA', 'CHLOE', 'Sophia', 'AALIYAH', 'Angela', 'Camila', 'Savannah', 'Serenity', 'Chloe', 'Fatoumata', 'ISABELLA', 'MIA', 'FIONA', 'Skylar', 'Ashley', 'Rachel', 'Sofia', 'Alina', 'MADISON', 'RACHEL', 'CAMILA', 'CHANA', 'TAYLOR', 'Kayla', 'Miriam', 'Leah', 'Grace', 'ANGELA', 'Isabella', 'Emma', 'KAYLA', 'SOFIA', 'Madison', 'Aaliyah', 'Taylor', 'GENESIS', 'Esther', 'MAKAYLA', 'Victoria', 'Chaya', 'Brielle', 'Anna', 'Samantha', 'ESTHER', 'GRACE', 'Mariam', 'Mia', 'NEVAEH', 'GABRIELLE', 'EMILY', 'London', 'TIFFANY', 'Chana', 'Valentina', 'OLIVIA', 'LONDON', 'MIRIAM', 'SARAH', 'ELLA']


In [5]:
boy_names = ['JOSIAH',
 'ETHAN',
 'David',
 'Jayden',
 'MASON',
 'RYAN',
 'CHRISTIAN',
 'ISAIAH',
 'JAYDEN',
 'Michael',
 'NOAH',
 'SAMUEL',
 'SEBASTIAN',
 'Noah',
 'Dylan',
 'LUCAS',
 'JOSHUA',
 'ANGEL',
 'Jacob',
 'Matthew',
 'Josiah',
 'JACOB',
 'Muhammad',
 'ALEXANDER',
 'Jason',
 'Ethan',
 'DANIEL',
 'Joseph',
 'AIDEN',
 'Moshe',
 'Jeremiah',
 'William',
 'Alexander',
 'Sebastian',
 'ERIC',
 'MOSHE',
 'Jack',
 'Eric',
 'MUHAMMAD',
 'Lucas',
 'BENJAMIN',
 'Aiden',
 'Ryan',
 'Liam',
 'JASON',
 'KEVIN',
 'Elijah',
 'Angel',
 'JAMES',
 'Daniel',
 'Samuel',
 'Amir',
 'Mason',
 'Joshua',
 'ANTHONY',
 'JOSEPH',
 'Benjamin',
 'JUSTIN',
 'JEREMIAH',
 'MATTHEW',
 'Carter',
 'James',
 'TYLER',
 'DAVID',
 'JACK',
 'ELIJAH',
 'MICHAEL',
 'CHRISTOPHER']

print(boy_names)

['JOSIAH', 'ETHAN', 'David', 'Jayden', 'MASON', 'RYAN', 'CHRISTIAN', 'ISAIAH', 'JAYDEN', 'Michael', 'NOAH', 'SAMUEL', 'SEBASTIAN', 'Noah', 'Dylan', 'LUCAS', 'JOSHUA', 'ANGEL', 'Jacob', 'Matthew', 'Josiah', 'JACOB', 'Muhammad', 'ALEXANDER', 'Jason', 'Ethan', 'DANIEL', 'Joseph', 'AIDEN', 'Moshe', 'Jeremiah', 'William', 'Alexander', 'Sebastian', 'ERIC', 'MOSHE', 'Jack', 'Eric', 'MUHAMMAD', 'Lucas', 'BENJAMIN', 'Aiden', 'Ryan', 'Liam', 'JASON', 'KEVIN', 'Elijah', 'Angel', 'JAMES', 'Daniel', 'Samuel', 'Amir', 'Mason', 'Joshua', 'ANTHONY', 'JOSEPH', 'Benjamin', 'JUSTIN', 'JEREMIAH', 'MATTHEW', 'Carter', 'James', 'TYLER', 'DAVID', 'JACK', 'ELIJAH', 'MICHAEL', 'CHRISTOPHER']


### Using and unpacking tuples
Tuples are made of several items just like a list, but they cannot be modified in any way. It is very common for tuples to be used to represent data from a database. If you have a tuple like ('chocolate chip cookies', 15) and you want to access each part of the data, you can use an index just like a list. However, you can also "unpack" the tuple into multiple variables such as type, count = ('chocolate chip cookies', 15) that will set type to 'chocolate chip cookies' and count to 15.

Often you'll want to pair up multiple array data types. The zip() function does just that. It will return a list of tuples containing one element from each list passed into zip().

When looping over a list, you can also track your position in the list by using the enumerate() function. The function returns the index of the list item you are currently on in the list and the list item itself.

You'll practice using the enumerate() and zip() functions in this exercise, in which your job is to pair up the most common boy and girl names. Two lists - girl_names and boy_names - have been pre-loaded into your workspace.

In [6]:
# Pair up the girl and boy names: pairs
pairs = zip(girl_names, boy_names)

# Iterate over pairs
for idx, pair in enumerate(pairs):
    # Unpack pair: girl_name, boy_name
    girl_name, boy_name = pair
    # Print the rank and names associated with each rank
    print('Rank {}: {} and {}'.format(idx, girl_name,boy_name ))

Rank 0: JADA and JOSIAH
Rank 1: Emily and ETHAN
Rank 2: Ava and David
Rank 3: SERENITY and Jayden
Rank 4: Claire and MASON
Rank 5: SOPHIA and RYAN
Rank 6: Sarah and CHRISTIAN
Rank 7: ASHLEY and ISAIAH
Rank 8: CHAYA and JAYDEN
Rank 9: ABIGAIL and Michael
Rank 10: Zoe and NOAH
Rank 11: LEAH and SAMUEL
Rank 12: HAILEY and SEBASTIAN
Rank 13: AVA and Noah
Rank 14: Olivia and Dylan
Rank 15: EMMA and LUCAS
Rank 16: CHLOE and JOSHUA
Rank 17: Sophia and ANGEL
Rank 18: AALIYAH and Jacob
Rank 19: Angela and Matthew
Rank 20: Camila and Josiah
Rank 21: Savannah and JACOB
Rank 22: Serenity and Muhammad
Rank 23: Chloe and ALEXANDER
Rank 24: Fatoumata and Jason
Rank 25: ISABELLA and Ethan
Rank 26: MIA and DANIEL
Rank 27: FIONA and Joseph
Rank 28: Skylar and AIDEN
Rank 29: Ashley and Moshe
Rank 30: Rachel and Jeremiah
Rank 31: Sofia and William
Rank 32: Alina and Alexander
Rank 33: MADISON and Sebastian
Rank 34: RACHEL and ERIC
Rank 35: CAMILA and MOSHE
Rank 36: CHANA and Jack
Rank 37: TAYLOR and Eric


### Making tuples by accident
Tuples are very powerful and useful, and it's super easy to make one by accident. All you have to do is create a variable and follow the assignment with a comma. This becomes an error when you try to use the variable later expecting it to be a string or a number.

You can verify the data type of a variable with the type() function. In this exercise, you'll see for yourself how easy it is to make a tuple by accident.

In [7]:
# Create the normal variable: normal
normal = 'simple'

# Create the mistaken variable: error
error = 'trailing comma',

# Print the types of the variables
print(type(normal))
print(type(error))

<class 'str'>
<class 'tuple'>


### Finding all the data and the overlapping data between sets
Sets have several methods to combine, compare, and study them all based on mathematical set theory. The .union() method returns a set of all the names found in the set you used the method on plus any sets passed as arguments to the method. You can also look for overlapping data in sets by using the .intersection() method on a set and passing another set as an argument. It will return an empty set if nothing matches.

Your job in this exercise is to find the union and intersection in the names from 2011 and 2014. For this purpose, two sets have been pre-loaded into your workspace: baby_names_2011 and baby_names_2014.

One quirk in the baby names dataset is that names in 2011 and 2012 are all in upper case, while names in 2013 and 2014 are in title case (where the first letter of each name is capitalized). Consequently, if you were to compare the 2011 and 2014 data in this form, you would find no overlapping names between the two years! To remedy this, we converted the names in 2011 to title case using Python's .title() method.

Real-world data can often come with quirks like this - it's important to catch them to ensure your results are meaningful.

In [36]:
import pandas as pd
records = pd.DataFrame(records)
records2011 = records[records['BRITH_YEAR'] == 2011]
records2014 = records[records['BRITH_YEAR'] == 2014]

baby_names_2011 = set(records2011['NAME'].str.title())
baby_names_2014 = set(records2014['NAME'])

# Combine all the names in baby_names_2011 and baby_names_2014 by computing their 
# union. Store the result as all_names.
all_names = baby_names_2011.union(baby_names_2014)

# Print the number of names that occur in all_names. You can use the len() function
# to compute the number of names in all_names.
print(len(all_names))

# Find the intersection: overlapping_names
overlapping_names = baby_names_2011.intersection(baby_names_2014)

# Print the count of names in overlapping_names
print(len(overlapping_names))

1461
986


### Determining set differences
Another way of comparing sets is to use the difference() method. It returns all the items found in one set but not another. It's important to remember the set you call the method on will be the one from which the items are returned. Unlike tuples, you can add() items to a set. A set will only add items that do not exist in the set.

In this exercise, you'll explore what names were common in 2011, but are no longer common in 2014. The set baby_names_2014 has been pre-loaded into your workspace. As in the previous exercise, the names have been converted to title case to ensure a proper comparison.

In [39]:
differences = baby_names_2011.difference(baby_names_2014)
# Print the differences
print(differences)
print(len(differences))

{'Ingrid', 'Diya', 'Cristina', 'Elianna', 'Roselyn', 'Anika', 'Johnathan', 'Keily', 'Hayley', 'Yu', 'Cody', 'Tzivia', 'Yair', 'Sidney', 'Yachet', 'Yida', 'Sekou', 'Kacper', 'Shneur', 'Stacy', 'Leela', 'Julien', 'Amrom', 'Kelvin', 'Makai', 'Dereck', 'Gittel', 'Nathalia', 'Rihanna', 'Nana', 'Ariela', 'Jaime', 'Gustavo', 'Roger', 'Jencarlos', 'Derrick', 'Carmine', 'Idy', 'Geraldine', 'Sade', "Amar'E", 'Troy', 'Princess', 'Leonel', 'Xin', 'Raquel', 'Zyaire', 'Maurice', 'Byron', 'Maximo', 'Jeancarlos', 'Jacky', 'Jamel', 'Annabel', 'Marisol', 'Amani', 'Yehudah', 'Jeremias', 'Tamia', 'Zahra', 'Jermaine', 'Ahron', 'Luz', 'Yerik', 'Keith', 'Christy', 'Paola', 'Giovanny', 'Alyson', 'Johann', 'Alec', 'Shevy', 'Milena', 'Fernanda', 'Brianny', 'Aditya', 'Perla', 'Shaniya', 'Essence', 'Denise', 'Krystal', 'Augustus', 'Cristopher', 'Michal', 'Kaelyn', 'Damaris', 'Malcolm', 'Ilan', 'Jaylyn', 'Alfredo', 'Lamar', 'Christine', 'Nataly', 'Marquis', 'Yaniel', 'Jaelynn', 'Mckenzie', 'Aldo', 'Jelani', 'Johan