# Fundamental data types in Data Science
>This chapter will introduce you to the fundamental Python data types - lists, sets, and tuples. These data containers are critical as they provide the basis for storing and looping over ordered data. To make things interesting, you'll apply what you learn about these types to answer questions about the New York Baby Names dataset! This is the Summary of lecture "Data Types for Data Science in Python", via datacamp.

- toc: true 
- badges: true
- comments: true
- author: Chanseok Kang
- categories: [Python, Datacamp, Data_Science]
- image: 

In [5]:
import pandas as pd

## Introduction and lists
- Container sequences
    - Hold other types of data
    - Used for aggregation, sorting, and more
    - Can be mutable (list, set) or immutable (tuple)
    - Iterable
- Lists
    - Hold data in order it was added
    - Mutable
    - Index

### Manipulating lists for fun and profit
You may be familiar with adding individual data elements to a list by using the `.append()` method. However, if you want to combine a list with another array type (list, set, tuple), you can use the `.extend()` method on the list.

You can also use the `.index()` method to find the position of an item in a list. You can then use that position to remove the item with the `.pop()` method.

In [3]:
# Create a list containing the names: baby_names
baby_names = ['Ximena', 'Aliza', 'Ayden', 'Calvin']

# Extend baby_names with 'Rowen' and 'Sandeep'
baby_names.extend(['Rowen', 'Sandeep'])

# Print baby_names
print(baby_names)

# Find the position of 'Aliza': position
position = baby_names.index('Aliza')

# Remove 'Aliza' from baby_names
baby_names.pop(position)

# Print baby_names
print(baby_names)

['Ximena', 'Aliza', 'Ayden', 'Calvin', 'Rowen', 'Sandeep']
['Ximena', 'Ayden', 'Calvin', 'Rowen', 'Sandeep']


### Looping over lists
You can use a `for` loop to iterate through all the items in a list. You can take that a step further with the `sorted()` function which will sort the data in a list from lowest to highest in the case of numbers and alphabetical order if the list contains strings.

The `sorted()` function returns a new list and does not affect the list you passed into the function. You can learn more about `sorted()` in the [Python documentation](https://docs.python.org/3/library/functions.html#sorted).

A list of lists, `records` has been pre-loaded, has a shape like this:
```python
['2011', 'FEMALE', 'HISPANIC', 'GERALDINE', '13', '75']
```

The name of the baby (`'GERALDINE'`) is the fourth entry of this list. Your job in this exercise is to loop over this list of lists and append the names of each baby to a new list called `baby_names`.

In [8]:
records = pd.read_csv('./dataset/baby_names.csv').values.tolist()

In [12]:
# Create the empty list: baby_names
baby_names = []

# Loop over records
for row in records:
    # Add the name to the list
    baby_names.append(row[3])
    
# Sort the names in alphabetical order
for i, name in enumerate(sorted(baby_names)):
    # Print each name
    if i % 1000 == 0:
        print(name)

AALIYAH
ANGELIQUE
Antonia
CHAVY
Dylan
FRANCISCO
ISABELLA
JOSEPHINE
KRYSTAL
Luciana
Margot
PENELOPE
SEBASTIAN
TRISTAN


## Meet the Tuples
- Tuple
    - Hold data in order
    - Index
    - Immutable
    - Pairing
    - Unpackable

### Using and unpacking tuples
Tuples are made of several items just like a list, but they cannot be modified in any way. It is very common for tuples to be used to represent data from a database. If you have a tuple like `('chocolate chip cookies', 15)` and you want to access each part of the data, you can use an index just like a list. However, you can also "unpack" the tuple into multiple variables such as type, `count = ('chocolate chip cookies', 15)` that will set type to 'chocolate chip cookies' and count to 15.

Often you'll want to pair up multiple array data types. The `zip()` function does just that. It will return a list of tuples containing one element from each list passed into `zip()`.

When looping over a list, you can also track your position in the list by using the `enumerate()` function. The function returns the index of the list item you are currently on in the list and the list item itself.

You'll practice using the `enumerate()` and `zip()` functions in this exercise, in which your job is to pair up the most common boy and girl names. 

In [13]:
#hide
girl_names = ['JADA',
 'Emily',
 'Ava',
 'SERENITY',
 'Claire',
 'SOPHIA',
 'Sarah',
 'ASHLEY',
 'CHAYA',
 'ABIGAIL',
 'Zoe',
 'LEAH',
 'HAILEY',
 'AVA',
 'Olivia',
 'EMMA',
 'CHLOE',
 'Sophia',
 'AALIYAH',
 'Angela',
 'Camila',
 'Savannah',
 'Serenity',
 'Chloe',
 'Fatoumata',
 'ISABELLA',
 'MIA',
 'FIONA',
 'Skylar',
 'Ashley',
 'Rachel',
 'Sofia',
 'Alina',
 'MADISON',
 'RACHEL',
 'CAMILA',
 'CHANA',
 'TAYLOR',
 'Kayla',
 'Miriam',
 'Leah',
 'Grace',
 'ANGELA',
 'Isabella',
 'Emma',
 'KAYLA',
 'SOFIA',
 'Madison',
 'Aaliyah',
 'Taylor',
 'GENESIS',
 'Esther',
 'MAKAYLA',
 'Victoria',
 'Chaya',
 'Brielle',
 'Anna',
 'Samantha',
 'ESTHER',
 'GRACE',
 'Mariam',
 'Mia',
 'NEVAEH',
 'GABRIELLE',
 'EMILY',
 'London',
 'TIFFANY',
 'Chana',
 'Valentina',
 'OLIVIA',
 'LONDON',
 'MIRIAM',
 'SARAH',
 'ELLA']

In [14]:
boy_names = ['JOSIAH',
 'ETHAN',
 'David',
 'Jayden',
 'MASON',
 'RYAN',
 'CHRISTIAN',
 'ISAIAH',
 'JAYDEN',
 'Michael',
 'NOAH',
 'SAMUEL',
 'SEBASTIAN',
 'Noah',
 'Dylan',
 'LUCAS',
 'JOSHUA',
 'ANGEL',
 'Jacob',
 'Matthew',
 'Josiah',
 'JACOB',
 'Muhammad',
 'ALEXANDER',
 'Jason',
 'Ethan',
 'DANIEL',
 'Joseph',
 'AIDEN',
 'Moshe',
 'Jeremiah',
 'William',
 'Alexander',
 'Sebastian',
 'ERIC',
 'MOSHE',
 'Jack',
 'Eric',
 'MUHAMMAD',
 'Lucas',
 'BENJAMIN',
 'Aiden',
 'Ryan',
 'Liam',
 'JASON',
 'KEVIN',
 'Elijah',
 'Angel',
 'JAMES',
 'Daniel',
 'Samuel',
 'Amir',
 'Mason',
 'Joshua',
 'ANTHONY',
 'JOSEPH',
 'Benjamin',
 'JUSTIN',
 'JEREMIAH',
 'MATTHEW',
 'Carter',
 'James',
 'TYLER',
 'DAVID',
 'JACK',
 'ELIJAH',
 'MICHAEL',
 'CHRISTOPHER']

In [15]:
# Pair up the girl and boy names: pairs
pairs = zip(girl_names, boy_names)

# Iterate over pairs
for idx, pair in enumerate(pairs):
    # Unpack pair: girl_name, boy_name
    girl_name, boy_name = pair
    # Print the rank and names associated with each rank
    print('Rank {}: {} and {}'.format(idx, girl_name, boy_name))

Rank 0: JADA and JOSIAH
Rank 1: Emily and ETHAN
Rank 2: Ava and David
Rank 3: SERENITY and Jayden
Rank 4: Claire and MASON
Rank 5: SOPHIA and RYAN
Rank 6: Sarah and CHRISTIAN
Rank 7: ASHLEY and ISAIAH
Rank 8: CHAYA and JAYDEN
Rank 9: ABIGAIL and Michael
Rank 10: Zoe and NOAH
Rank 11: LEAH and SAMUEL
Rank 12: HAILEY and SEBASTIAN
Rank 13: AVA and Noah
Rank 14: Olivia and Dylan
Rank 15: EMMA and LUCAS
Rank 16: CHLOE and JOSHUA
Rank 17: Sophia and ANGEL
Rank 18: AALIYAH and Jacob
Rank 19: Angela and Matthew
Rank 20: Camila and Josiah
Rank 21: Savannah and JACOB
Rank 22: Serenity and Muhammad
Rank 23: Chloe and ALEXANDER
Rank 24: Fatoumata and Jason
Rank 25: ISABELLA and Ethan
Rank 26: MIA and DANIEL
Rank 27: FIONA and Joseph
Rank 28: Skylar and AIDEN
Rank 29: Ashley and Moshe
Rank 30: Rachel and Jeremiah
Rank 31: Sofia and William
Rank 32: Alina and Alexander
Rank 33: MADISON and Sebastian
Rank 34: RACHEL and ERIC
Rank 35: CAMILA and MOSHE
Rank 36: CHANA and Jack
Rank 37: TAYLOR and Eric


### Making tuples by accident
Tuples are very powerful and useful, and it's super easy to make one by accident. All you have to do is create a variable and follow the assignment with a comma. This becomes an error when you try to use the variable later expecting it to be a string or a number.

You can verify the data type of a variable with the `type()` function. In this exercise, you'll see for yourself how easy it is to make a tuple by accident.

In [17]:
# Create the normal variable: normal
normal = 'simple'

# Create the mistaken variable: error
error = 'trailing comma',

# Print the types of the variables
print(type(normal))
print(type(error))

<class 'str'>
<class 'tuple'>


## Sets for unordered and unique data
- Set
    - Unique
    - Unordered
    - Mutable


### Finding all the data and the overlapping data between sets
Sets have several methods to combine, compare, and study them all based on mathematical set theory. The `.union()` method returns a set of all the names found in the set you used the method on plus any sets passed as arguments to the method. You can also look for overlapping data in sets by using the `.intersection()` method on a set and passing another set as an argument. It will return an empty set if nothing matches.

Your job in this exercise is to find the union and intersection in the names from 2011 and 2014. For this purpose, two sets have been pre-loaded into your workspace: `baby_names_2011` and `baby_names_2014`.

One quirk in the baby names dataset is that names in 2011 and 2012 are all in upper case, while names in 2013 and 2014 are in title case (where the first letter of each name is capitalized). Consequently, if you were to compare the 2011 and 2014 data in this form, you would find no overlapping names between the two years! To remedy this, we converted the names in 2011 to title case using Python's `.title()` method.

Real-world data can often come with quirks like this - it's important to catch them to ensure your results are meaningful.

In [28]:
baby_names = pd.read_csv('./dataset/baby_names.csv')
baby_names_2011 = set(baby_names[baby_names['BIRTH_YEAR'] == 2011]['NAME'].str.title())
baby_names_2014 = set(baby_names[baby_names['BIRTH_YEAR'] == 2014]['NAME'])

In [29]:
# Find the union: all_names
all_names = baby_names_2011.union(baby_names_2014)

# Print the count of names in all_names
print(len(all_names))

# Find the intersection: overlapping_names
overlapping_names = baby_names_2011.intersection(baby_names_2014)

# Print the count of names in overlapping_names
print(len(overlapping_names))

1461
986


### Determining set differences
Another way of comparing sets is to use the `difference()` method. It returns all the items found in one set but not another. It's important to remember the set you call the method on will be the one from which the items are returned. Unlike tuples, you can `add()` items to a set. A set will only add items that do not exist in the set.

In this exercise, you'll explore what names were common in 2011, but are no longer common in 2014. 

In [33]:
# Create the empty set: baby_names_2011
baby_names_2011 = set()

# Loop over records and add the names from 2011 to the baby_names_2011 set
for row in records:
    # Check if the first column is '2011'
    if row[0] == '2011':
        baby_names_2011.add(row[3])
        
# Find the difference between 2011 and 2014: differences
differences = baby_names_2011.difference(baby_names_2014)

# Print the differences
print(differences)

set()
