# Concatenate

Data may not always come in 1 file
- 5 million row dataset may be broken in separate files
- time series come with data per day
- important to combine, to clean a single data set
- or clean separately, and combine

Concatenate data frames.  2 Dataframes into a single dataframe

### Concat() Function
- Note that Index numbers are kept the same



In [9]:
import pandas as pd


In [10]:
weather_p1_dict = {'date': ['2010-01-30', '2010-01-30'], 
                   'element': ['tmax', 'tmin'], 
                   'value': [27.8, 14.5]}
weather_p1 = pd.DataFrame(weather_p1_dict)

weather_p2_dict = {'date': ['2010-02-02', '2010-02-02'], 
                   'element': ['tmax', 'tmin'], 
                   'value': [27.3, 14.4]}
weather_p2 = pd.DataFrame(weather_p2_dict)

In [11]:
print(weather_p1)
print(weather_p2)

         date element  value
0  2010-01-30    tmax   27.8
1  2010-01-30    tmin   14.5
         date element  value
0  2010-02-02    tmax   27.3
1  2010-02-02    tmin   14.4


In [13]:
# concat; however notice the Index column has 0, 1 repeated

concatenated = pd.concat([weather_p1, weather_p2])

print(concatenated)



         date element  value
0  2010-01-30    tmax   27.8
1  2010-01-30    tmin   14.5
0  2010-02-02    tmax   27.3
1  2010-02-02    tmin   14.4


In [15]:
# Notice that using Loc on 0 index returns two data points

concatenated.loc[0, :]

Unnamed: 0,date,element,value
0,2010-01-30,tmax,27.8
0,2010-02-02,tmax,27.3


#### Reset the row index labels

In [17]:
pd.concat([weather_p1, weather_p2], ignore_index=True)

Unnamed: 0,date,element,value
0,2010-01-30,tmax,27.8
1,2010-01-30,tmin,14.5
2,2010-02-02,tmax,27.3
3,2010-02-02,tmin,14.4


In [18]:
data = pd.read_csv('./data/nyc_uber_2014.csv')

In [27]:
data.head()

Unnamed: 0.1,Unnamed: 0,Date/Time,Lat,Lon,Base
0,0,4/1/2014 0:11:00,40.769,-73.9549,B02512
1,1,4/1/2014 0:17:00,40.7267,-74.0345,B02512
2,2,4/1/2014 0:21:00,40.7316,-73.9873,B02512
3,3,4/1/2014 0:28:00,40.7588,-73.9776,B02512
4,4,4/1/2014 0:33:00,40.7594,-73.9722,B02512


#### Combining columns of data

Think of column-wise concatenation of data as stitching data together from the sides instead of the top and bottom. To perform this action, you use the same pd.concat() function, but this time with the keyword argument axis=1. The default, axis=0, is for a row-wise concatenation.

You'll return to the Ebola dataset you worked with briefly in the last chapter. It has been pre-loaded into a DataFrame called ebola_melt. In this DataFrame, the status and country of a patient is contained in a single column. This column has been parsed into a new DataFrame, status_country, where there are separate columns for status and country.

Explore the ebola_melt and status_country DataFrames in the IPython Shell. Your job is to concatenate them column-wise in order to obtain a final, clean DataFrame.

In [None]:
# Concatenate ebola_melt and status_country column-wise: ebola_tidy
ebola_tidy = pd.concat([ebola_melt, status_country], axis=1)

# Print the shape of ebola_tidy
print(ebola_tidy.shape)

# Print the head of ebola_tidy
print(ebola_tidy.head())


# Concatenating Many files

- Leverage Python's features with data cleaning in Pandas
- In order to concatenate DataFrames:
    - They must be in a list
    - Can individually load if there are a few datasets
    - But what if there are thousands of files to concat???
    
### Glob
- Find files based on a pattern
- Globbing does file name pattern matching
- use wildcards *
    - Astrik is a wildcard for strings
    - any csv file: *.csv
    - Any Single Character ?
    - file_?.csv
    - ? only matches 1 character, any number 0-9, or letter a-z

- Glob returns a list of file names
- Use this list of names to load into separate dataframes


### Glob Plan
- Load files from globbing into Pandas
- Add the DataFrames into a List
- Concatenate multipe datasets at once

In [30]:
import glob

csv_files = glob.glob('./data/*.csv')

print(csv_files)

['./data/airquality.csv', './data/nyc_uber_2014.csv', './data/tips.csv', './data/dob_job_application_filings_subset.csv', './data/ebola.csv', './data/tb.csv', './data/gapminder.csv']


In [33]:
# initiate empty list
list_data = []

# use for loop to read through each csv file and append to list_data
for filename in csv_files:
    # iterate through each filename, load data to df, append the data to a list
    #data = pd.read_csv(filename)
    #list_data.append(data)
    
# Finally, concat the list_data into a single dataframe
pd.concat(list_data)

SyntaxError: unexpected EOF while parsing (<ipython-input-33-d44380248121>, line 6)

#### Finding files that match a pattern

You're now going to practice using the glob module to find all csv files in the workspace. In the next exercise, you'll programmatically load them into DataFrames.

As Dan showed you in the video, the glob module has a function called glob that takes a pattern and returns a list of the files in the working directory that match that pattern.

For example, if you know the pattern is part_ single digit number .csv, you can write the pattern as 'part_?.csv' (which would match part_1.csv, part_2.csv, part_3.csv, etc.)

Similarly, you can find all .csv files with '*.csv', or all parts with 'part_*'. The ? wildcard represents any 1 character, and the * wildcard represents any number of characters.

In [None]:
# Import necessary modules
import glob
import pandas as pd

# Write the pattern: pattern
pattern = '*.csv'

# Save all file matches: csv_files
csv_files = glob.glob(pattern)

# Print the file names
print(csv_files)


# Load the second file into a DataFrame: csv2
csv2 = pd.read_csv(csv_files[1])

# Print the head of csv2
print(csv2.head())

#### Iterating and concatenating all matches

Now that you have a list of filenames to load, you can load all the files into a list of DataFrames that can then be concatenated.

You'll start with an empty list called frames. Your job is to use a for loop to:

iterate through each of the filenames
read each filename into a DataFrame, and then
append it to the frames list.
You can then concatenate this list of DataFrames using pd.concat(). Go for it!

In [None]:
# Create an empty list: frames
frames = []

#  Iterate over csv_files
for csv in csv_files:

    #  Read csv into a DataFrame: df
    df = pd.read_csv(csv)
    
    # Append df to frames
    frames.append(df)

# Concatenate frames into a single DataFrame: uber
uber = pd.concat(frames)

# Print the shape of uber
print(uber.shape)

# Print the head of uber
print(uber.head())


Fantastic work! You can now programmatically combine datasets that are broken up into many smaller parts. You'll find many datasets in the wild will be stored this way, particularly data that is collected incrementally.



# Merging like SQL
- Concatenation is not the only way data can be combined
- You cannot concatenate df when the ordering of states are not the same
- Merging Data
- Similar to Join in SQL
- Combine disparate datasets based on a common set of columns

In [36]:
data_dict = {'state': ['California', 'Texas', 'Florida', 'New York'],
             'population_2016': ['29250017', '27862596', '20612439', '19745289']}

data_population = pd.DataFrame(data_dict)

In [37]:
data_population

Unnamed: 0,population_2016,state
0,29250017,Cali
1,27862596,Texas
2,20612439,Florida
3,19745289,New York


In [40]:
data_ansi_dict = {'name': ['California', 'Texas', 'Florida', 'New York'],
             'ANSI': ['CA', 'TX', 'FL', 'NY']}

state_codes = pd.DataFrame(data_ansi_dict)

In [41]:
state_codes

Unnamed: 0,ANSI,name
0,CA,California
1,TX,Texas
2,FL,Florida
3,NY,New York


###  Merge

merge Keys on the State and Name
Example of a one-to-one Merge

In [63]:
survey = pd.read_csv('./data/survey.txt')
site = pd.read_csv('./data/site.txt')
visited = pd.read_csv('./data/visited.txt')

In [64]:
visited

Unnamed: 0,ident,site,dated
0,619,DR-1,1927-02-08
1,622,DR-1,1927-02-10
2,734,DR-3,1939-01-07
3,735,DR-3,1930-01-12
4,751,DR-3,1930-02-26
5,752,DR-3,
6,837,MSK-4,1932-01-14
7,844,DR-1,1932-03-22


In [65]:
# on=None parameter, used if two column names are the same
# however, if the column names are different, you have to specify
# the left_on and right_on parameter


pd.merge(left=data_population, right=state_codes, 
        on=None, left_on='state', right_on='name')

Unnamed: 0,population_2016,state,ANSI,name
0,27862596,Texas,TX,Texas
1,20612439,Florida,FL,Florida
2,19745289,New York,NY,New York


#### Types of Merges

- One-to-One
- Many to One/ One to Many
    - Duplicate values will be repeated
    - Multiple Cities to each state
    - Value of the State Code will be repeated
- Many to Many
    - 
    


#### 1-to-1 data merge

Merging data allows you to combine disparate datasets into a single dataset to do more complex analysis.

Here, you'll be using survey data that contains readings that William Dyer, Frank Pabodie, and Valentina Roerich took in the late 1920 and 1930 while they were on an expedition towards Antarctica. The dataset was taken from a sqlite database from the Software Carpentry SQL lesson.

Two DataFrames have been pre-loaded: site and visited. Explore them in the IPython Shell and take note of their structure and column names. Your task is to perform a 1-to-1 merge of these two DataFrames using the 'name' column of site and the 'site' column of visited.

In [66]:
site.columns

Index(['name', ' lat', ' long'], dtype='object')

In [67]:
visited.columns

Index(['ident', 'site', 'dated'], dtype='object')

In [69]:
# Merge the DataFrames: o2o
o2o = pd.merge(left=site, right=visited, left_on='name', right_on='site')

# Print o2o
print(o2o)


Empty DataFrame
Columns: [name,  lat,  long, ident, site, dated]
Index: []


#### Many-to-1 data merge

In a many-to-one (or one-to-many) merge, one of the values will be duplicated and recycled in the output. That is, one of the keys in the merge is not unique.

Here, the two DataFrames site and visited have been pre-loaded once again. Note that this time, visited has multiple entries for the site column. Confirm this by exploring it in the IPython Shell.

The .merge() method call is the same as the 1-to-1 merge from the previous exercise, but the data and output will be different.

In [None]:
# Merge the DataFrames: m2o
m2o = pd.merge(left=site, right=visited, left_on='name', right_on='site')

# Print m2o
print(m2o)


#### Many-to-many data merge

The final merging scenario occurs when both DataFrames do not have unique keys for a merge. What happens here is that for each duplicated key, every pairwise combination will be created.

Two example DataFrames that share common key values have been pre-loaded: df1 and df2. Another DataFrame df3, which is the result of df1 merged with df2, has been pre-loaded. All three DataFrames have been printed - look at the output and notice how pairwise combinations have been created. This example is to help you develop your intuition for many-to-many merges.

Here, you'll work with the site and visited DataFrames from before, and a new survey DataFrame. Your task is to merge site and visited as you did in the earlier exercises. You will then merge this merged DataFrame with survey.

Begin by exploring the site, visited, and survey DataFrames in the IPython Shell.

In [None]:
# Merge site and visited: m2m
m2m = pd.merge(left=site, right=visited, left_on='name', right_on='site')

# Merge m2m and survey: m2m
m2m = pd.merge(left=m2m, right=survey, left_on='ident', right_on='taken')

# Print the first 20 lines of m2m
print(m2m.head(20))
