# Concatenating data
## Combining data
- Data may not always come in 1 huge file
    - 5 million row dataset may be broken into 5 separate datasets
    - Easier to store and share
    - May have new data for each day
- Important to be able to combine then clean, or vice versa

## Combining rows of data
The dataset you'll be working with here relates to [NYC Uber data][1]. The original dataset has all the originating Uber pickup locations by time and latitude and longitude. For didactic purposes, you'll be working with a very small portion of the actual data.

There are three DataFrames: `uber1`, which contains data for April 2014, `uber2`, which contains data for May 2014, and `uber3`, which contains data for June 2014. Your job in this exercise is to concatenate these DataFrames together such that the resulting DataFrame has the data for all three months.

[1]:http://data.beta.nyc/dataset/uber-trip-data-foiled-apr-sep-2014

In [1]:
import pandas as pd

In [2]:
filepath = '../_datasets/'

In [3]:
uber1 = pd.read_csv('nyc_uber_2014_April.csv', index_col=0)
uber2 = pd.read_csv('nyc_uber_2014_May.csv', index_col=0)
uber3 = pd.read_csv('nyc_uber_2014_June.csv', index_col=0)

In [4]:
print(uber1.shape)
uber1.head()

(99, 4)


Unnamed: 0,Date/Time,Lat,Lon,Base
0,4/1/2014 0:11:00,40.769,-73.9549,B02512
1,4/1/2014 0:17:00,40.7267,-74.0345,B02512
2,4/1/2014 0:21:00,40.7316,-73.9873,B02512
3,4/1/2014 0:28:00,40.7588,-73.9776,B02512
4,4/1/2014 0:33:00,40.7594,-73.9722,B02512


In [5]:
print(uber2.shape)
uber2.head()

(99, 4)


Unnamed: 0,Date/Time,Lat,Lon,Base
0,5/1/2014 0:02:00,40.7521,-73.9914,B02512
1,5/1/2014 0:06:00,40.6965,-73.9715,B02512
2,5/1/2014 0:15:00,40.7464,-73.9838,B02512
3,5/1/2014 0:17:00,40.7463,-74.0011,B02512
4,5/1/2014 0:17:00,40.7594,-73.9734,B02512


In [6]:
print(uber3.shape)
uber3.head()

(99, 4)


Unnamed: 0,Date/Time,Lat,Lon,Base
0,6/1/2014 0:00:00,40.7293,-73.992,B02512
1,6/1/2014 0:01:00,40.7131,-74.0097,B02512
2,6/1/2014 0:04:00,40.3461,-74.661,B02512
3,6/1/2014 0:04:00,40.7555,-73.9833,B02512
4,6/1/2014 0:07:00,40.688,-74.1831,B02512


In [7]:
# Concatenate uber1, uber2, and uber3: row_concat
row_concat = pd.concat([uber1,uber2,uber3])

# Print the shape of row_concat
print(row_concat.shape)

(297, 4)


In [8]:
# Print the head of row_concat
row_concat.head()

Unnamed: 0,Date/Time,Lat,Lon,Base
0,4/1/2014 0:11:00,40.769,-73.9549,B02512
1,4/1/2014 0:17:00,40.7267,-74.0345,B02512
2,4/1/2014 0:21:00,40.7316,-73.9873,B02512
3,4/1/2014 0:28:00,40.7588,-73.9776,B02512
4,4/1/2014 0:33:00,40.7594,-73.9722,B02512


## Combining columns of data
Think of column-wise concatenation of data as stitching data together from the sides instead of the top and bottom. To perform this action, you use the same `pd.concat()` function, but this time with the keyword argument `axis=1`. The default, `axis=0`, is for a row-wise concatenation.

You'll return to the [Ebola dataset][1] you worked with briefly in the last chapter. You need to build a DataFrame called `ebola_melt`. In this DataFrame, the status and country of a patient is contained in a single column. This column has been parsed into a new DataFrame, `status_country`, where there are separate columns for status and country.

Your job is to concatenate them column-wise in order to obtain a final, clean DataFrame.

[1]: https://data.humdata.org/dataset/ebola-cases-2014

In [9]:
ebola = pd.read_csv("../_datasets/ebola.csv")
ebola.head()

Unnamed: 0,Date,Day,Cases_Guinea,Cases_Liberia,Cases_SierraLeone,Cases_Nigeria,Cases_Senegal,Cases_UnitedStates,Cases_Spain,Cases_Mali,Deaths_Guinea,Deaths_Liberia,Deaths_SierraLeone,Deaths_Nigeria,Deaths_Senegal,Deaths_UnitedStates,Deaths_Spain,Deaths_Mali
0,1/5/2015,289,2776.0,,10030.0,,,,,,1786.0,,2977.0,,,,,
1,1/4/2015,288,2775.0,,9780.0,,,,,,1781.0,,2943.0,,,,,
2,1/3/2015,287,2769.0,8166.0,9722.0,,,,,,1767.0,3496.0,2915.0,,,,,
3,1/2/2015,286,,8157.0,,,,,,,,3496.0,,,,,,
4,12/31/2014,284,2730.0,8115.0,9633.0,,,,,,1739.0,3471.0,2827.0,,,,,


In [10]:
# Melt ebola: ebola_melt
ebola_melt = pd.melt(ebola, id_vars=['Date', 'Day'], var_name='type_country', value_name='counts')

ebola_melt.head()

Unnamed: 0,Date,Day,type_country,counts
0,1/5/2015,289,Cases_Guinea,2776.0
1,1/4/2015,288,Cases_Guinea,2775.0
2,1/3/2015,287,Cases_Guinea,2769.0
3,1/2/2015,286,Cases_Guinea,
4,12/31/2014,284,Cases_Guinea,2730.0


In [11]:
str_split = ebola_melt.type_country.str.split('_')
status_country = pd.DataFrame({'status':str_split.str[0],'country':str_split.str[1]})
status_country.head()

Unnamed: 0,status,country
0,Cases,Guinea
1,Cases,Guinea
2,Cases,Guinea
3,Cases,Guinea
4,Cases,Guinea


In [12]:
# Concatenate ebola_melt and status_country column-wise: ebola_tidy
ebola_tidy = pd.concat([ebola_melt,status_country], axis=1)

# Print the shape of ebola_tidy
print(ebola_tidy.shape)

# Print the head of ebola_tidy
ebola_tidy.head()

(1952, 6)


Unnamed: 0,Date,Day,type_country,counts,status,country
0,1/5/2015,289,Cases_Guinea,2776.0,Cases,Guinea
1,1/4/2015,288,Cases_Guinea,2775.0,Cases,Guinea
2,1/3/2015,287,Cases_Guinea,2769.0,Cases,Guinea
3,1/2/2015,286,Cases_Guinea,,Cases,Guinea
4,12/31/2014,284,Cases_Guinea,2730.0,Cases,Guinea


# Finding and concatenating data
## Concatenating many files
- Leverage Python’s features with data cleaning in pandas
- In order to concatenate DataFrames:
    - They must be in a list
    - Can individually load if there are a few datasets
    - But what if there are thousands?
- Solution: glob function to find files based on a pattern

## Globbing
- Pattern matching for file names
- Wildcards: * ?
    - Any csv file: *.csv
    - Any single character: file_?.csv
    - Returns a list of file names
- Can use this list to load into separate DataFrames

## The plan
- Load files from globbing into pandas
- Add the DataFrames into a list
- Concatenate multiple datasets at once

## Finding files that match a pattern
You're now going to practice using the `glob` module to find all csv files in the workspace. In the next exercise, you'll programmatically load them into DataFrames.

As Dan showed you in the video, the `glob` module has a function called `glob` that takes a pattern and returns a list of the files in the working directory that match that pattern.

For example, if you know the pattern is `part_` `single digit number` `.csv`, you can write the pattern as `'part_?.csv'` (which would match `part_1.csv`, `part_2.csv`, `part_3.csv`, etc.)

Similarly, you can find all `.csv` files with `'*.csv'`, or all parts with `'part_*'`. The `?` wildcard represents any 1 character, and the `*` wildcard represents any number of characters.

In [13]:
# Import necessary modules
import pandas as pd
import glob

# Write the pattern: pattern
pattern = '*.csv'

# Save all file matches: csv_files
csv_files = glob.glob(pattern)

# Print the file names
print(csv_files)

['nyc_uber_2014_April.csv', 'nyc_uber_2014_June.csv', 'nyc_uber_2014_May.csv']


In [14]:
# Load the second file into a DataFrame: csv2
csv2 = pd.read_csv(csv_files[1], index_col=0)

# Print the head of csv2
csv2.head()

Unnamed: 0,Date/Time,Lat,Lon,Base
0,6/1/2014 0:00:00,40.7293,-73.992,B02512
1,6/1/2014 0:01:00,40.7131,-74.0097,B02512
2,6/1/2014 0:04:00,40.3461,-74.661,B02512
3,6/1/2014 0:04:00,40.7555,-73.9833,B02512
4,6/1/2014 0:07:00,40.688,-74.1831,B02512


## Iterating and concatenating all matches
Now that you have a list of filenames to load, you can load all the files into a list of DataFrames that can then be concatenated.

You'll start with an empty list called `frames`. Your job is to use a for loop to:

1. iterate through each of the filenames
2. read each filename into a DataFrame, and then
3. append it to the `frames` list.

You can then concatenate this list of DataFrames using `pd.concat()`. Go for it!

In [15]:
# Import necessary modules
import pandas as pd
import glob

# Save all file matches: csv_files
csv_files = glob.glob('*.csv')

# Create an empty list: frames
frames = []

#  Iterate over csv_files
for csv in csv_files:

    #  Read csv into a DataFrame: df
    df = pd.read_csv(csv, index_col=0)
    
    # Append df to frames
    frames.append(df)

# Concatenate frames into a single DataFrame: uber
uber = pd.concat(frames)

# Print the shape of uber
print(uber.shape)

(297, 4)


In [16]:
# Print the head of uber
uber.head()

Unnamed: 0,Date/Time,Lat,Lon,Base
0,4/1/2014 0:11:00,40.769,-73.9549,B02512
1,4/1/2014 0:17:00,40.7267,-74.0345,B02512
2,4/1/2014 0:21:00,40.7316,-73.9873,B02512
3,4/1/2014 0:28:00,40.7588,-73.9776,B02512
4,4/1/2014 0:33:00,40.7594,-73.9722,B02512


# Merge data
## Merging data
- Similar to joining tables in SQL
- Combine disparate datasets based on common columns

## Types of merges
- One-to-one
- Many-to-one / one-to-many
- Many-to-many
- All use the same function
- Only difference is the DataFrames you are merging

## 1-to-1 data merge
Merging data allows you to combine disparate datasets into a single dataset to do more complex analysis.

Here, you'll be using survey data that contains readings that William Dyer, Frank Pabodie, and Valentina Roerich took in the late 1920 and 1930 while they were on an expedition towards Antarctica. The dataset was taken from a sqlite database from the [Software Carpentry SQL lesson][1].

Two DataFrames have been pre-loaded: `site` and `visited`. Explore them in the IPython Shell and take note of their structure and column names. Your task is to perform a 1-to-1 merge of these two DataFrames using the `'name'` column of site and the `'site'` column of visited.

[1]:http://swcarpentry.github.io/sql-novice-survey/

In [17]:
dict_site = {'name': ['DR1', 'DR3', 'MSK-4'], 'lat': [-49.85, -47.15, -48.87], 'long': [-128.57, -126.72, -123.40]}
dict_visited = {'ident': [619, 734, 837], 'site': ['DR1', 'DR3', 'MSK-4'], 'dated': ['1927-02-08', '1939-01-07', '1932-01-14']}

In [18]:
site = pd.DataFrame(dict_site)
site

Unnamed: 0,name,lat,long
0,DR1,-49.85,-128.57
1,DR3,-47.15,-126.72
2,MSK-4,-48.87,-123.4


In [19]:
visited = pd.DataFrame(dict_visited)
visited

Unnamed: 0,ident,site,dated
0,619,DR1,1927-02-08
1,734,DR3,1939-01-07
2,837,MSK-4,1932-01-14


In [20]:
# Merge the DataFrames: o2o
o2o = pd.merge(left=site, right=visited, left_on='name', right_on='site')

# Print one-to-one (o2o) merge
o2o

Unnamed: 0,name,lat,long,ident,site,dated
0,DR1,-49.85,-128.57,619,DR1,1927-02-08
1,DR3,-47.15,-126.72,734,DR3,1939-01-07
2,MSK-4,-48.87,-123.4,837,MSK-4,1932-01-14


## Many-to-1 data merge
In a many-to-one (or one-to-many) merge, one of the values will be duplicated and recycled in the output. That is, one of the keys in the merge is not unique.

Here, the two DataFrames `site` and `visited` have been pre-loaded once again. Note that this time, `visited` has multiple entries for the site column.

The `.merge()` method call is the same as the 1-to-1 merge from the previous exercise, but the data and output will be different.

In [21]:
dict_visited ={
    'ident': [619,622,734,735,751,752,837,844], 
    'site': ['DR1','DR1','DR3','DR3','DR3','DR3','MSK-4','DR1'], 
    'dated': ['1927-02-08','1927-02-10','1939-01-07','1930-01-12','1930-02-26','NaN','1932-01-14','1932-03-22']
}
visited = pd.DataFrame(dict_visited)
visited

Unnamed: 0,ident,site,dated
0,619,DR1,1927-02-08
1,622,DR1,1927-02-10
2,734,DR3,1939-01-07
3,735,DR3,1930-01-12
4,751,DR3,1930-02-26
5,752,DR3,
6,837,MSK-4,1932-01-14
7,844,DR1,1932-03-22


In [22]:
# Merge the DataFrames: m2o
m2o = pd.merge(left=site, right=visited, left_on='name', right_on='site')

# Print m2o
m2o

Unnamed: 0,name,lat,long,ident,site,dated
0,DR1,-49.85,-128.57,619,DR1,1927-02-08
1,DR1,-49.85,-128.57,622,DR1,1927-02-10
2,DR1,-49.85,-128.57,844,DR1,1932-03-22
3,DR3,-47.15,-126.72,734,DR3,1939-01-07
4,DR3,-47.15,-126.72,735,DR3,1930-01-12
5,DR3,-47.15,-126.72,751,DR3,1930-02-26
6,DR3,-47.15,-126.72,752,DR3,
7,MSK-4,-48.87,-123.4,837,MSK-4,1932-01-14


## Many-to-many data merge
The final merging scenario occurs when both DataFrames do not have unique keys for a merge. What happens here is that for each duplicated key, every pairwise combination will be created.

Two example DataFrames that share common key values have been pre-loaded: `df1` and `df2`. Another DataFrame `df3`, which is the result of `df1` merged with `df2`, has been pre-loaded. This example is to help you develop your intuition for many-to-many merges.

Here, you'll work with the `site` and `visited` DataFrames from before, and a new `survey` DataFrame. Your task is to merge `site` and `visited` as you did in the earlier exercises. You will then merge this merged DataFrame with `survey`.

In [23]:
dict_survey ={
    'taken': [619,619,622,622,734,734,734,735,735,735,751,751,751,752,752,752,752,837,837,837,844], 
    'person':['dyer','dyer','dyer','dyer','pb','lake','pb','pb','NaN','NaN','pb','pb','lake','lake','lake','lake','roe','lake','lake','roe','roe'], 
    'quant': ['rad','sal','rad','sal','rad','sal','temp','rad','sal','temp','rad','temp','sal','rad','sal','temp','sal','rad','sal','sal','rad'],
    'reading':[9.82,0.13,7.80,0.09,8.41,0.05,-21.50,7.22,0.06,-26.00,4.35,-18.50,0.10,2.19,0.09,-16.00,41.60,1.46,0.21,22.50,11.25]
}
survey = pd.DataFrame(dict_survey)
survey.head() 

Unnamed: 0,taken,person,quant,reading
0,619,dyer,rad,9.82
1,619,dyer,sal,0.13
2,622,dyer,rad,7.8
3,622,dyer,sal,0.09
4,734,pb,rad,8.41


In [24]:
# Merge site and visited: m2m
m2o = pd.merge(left=site, right=visited, left_on='name', right_on='site')

# Merge m2o and survey: m2m
m2m = pd.merge(left=m2o, right=survey, left_on='ident', right_on='taken')

# Print the first 20 lines of m2m
m2m.head(20)

Unnamed: 0,name,lat,long,ident,site,dated,taken,person,quant,reading
0,DR1,-49.85,-128.57,619,DR1,1927-02-08,619,dyer,rad,9.82
1,DR1,-49.85,-128.57,619,DR1,1927-02-08,619,dyer,sal,0.13
2,DR1,-49.85,-128.57,622,DR1,1927-02-10,622,dyer,rad,7.8
3,DR1,-49.85,-128.57,622,DR1,1927-02-10,622,dyer,sal,0.09
4,DR1,-49.85,-128.57,844,DR1,1932-03-22,844,roe,rad,11.25
5,DR3,-47.15,-126.72,734,DR3,1939-01-07,734,pb,rad,8.41
6,DR3,-47.15,-126.72,734,DR3,1939-01-07,734,lake,sal,0.05
7,DR3,-47.15,-126.72,734,DR3,1939-01-07,734,pb,temp,-21.5
8,DR3,-47.15,-126.72,735,DR3,1930-01-12,735,pb,rad,7.22
9,DR3,-47.15,-126.72,735,DR3,1930-01-12,735,,sal,0.06
