### Combining rows of data

    - Three DataFrames have been pre-loaded: uber1, which contains data for April 2014, 
    - uber2, which contains data for May 2014, and 
    - uber3, which contains data for June 2014. 
    - Your job in this exercise is to concatenate these DataFrames together such that the resulting DataFrame has the data for all three months.

In [None]:
# Concatenate uber1, uber2, and uber3: row_concat
row_concat = pd.concat([uber1,uber2,uber3])

# Print the shape of row_concat
print(row_concat.shape)

# Print the head of row_concat
print(row_concat.head())

    - You have successfully concatenated the three uber DataFrames! 
    - Notice that the head of row_concat is the same as the head of uber1, 
    - while the tail of row_concat is the same as the tail of uber3

### Combining columns of data

    - Think of column-wise concatenation of data as stitching data together from the sides instead of the top and bottom. 
    - To perform this action, you use the same pd.concat() function, but this time with the keyword argument axis=1. 
    - The default, axis=0, is for a row-wise concatenation.

    - you'll return to the Ebola dataset you worked with briefly in the last chapter. 
    - It has been pre-loaded into a DataFrame called ebola_melt. In this DataFrame, 
    - the status and country of a patient is contained in a single column. 
    - This column has been parsed into a new DataFrame, status_country, where there are separate columns for status and country.

    - Explore the ebola_melt and status_country DataFrames in the IPython Shell. 
    - Your job is to concatenate them column-wise in order to obtain a final, clean DataFrame.

In [None]:
# Concatenate ebola_melt and status_country column-wise: ebola_tidy
ebola_tidy = pd.concat([ebola_melt,status_country],axis = 1)

# Print the shape of ebola_tidy
print(ebola_tidy.shape)

# Print the head of ebola_tidy
print(ebola_tidy.head())

    - The concatenated DataFrame has 6 columns, as it should. Notice how the status and country columns have been concatenated column-wise.

### Finding files that match a pattern

    - You're now going to practice using the glob module to find all csv files in the workspace
    - As Dan showed you in the video, the glob module has a function called glob that takes a pattern and returns a list of the files in the working directory that match that pattern.

    - For example, if you know the pattern is part_ single digit number .csv, you can write the pattern as 'part_?.csv' (which would match part_1.csv, part_2.csv, part_3.csv, etc.)
    - Similarly, you can find all .csv files with '*.csv', or all parts with 'part_*'. 
    - The ? wildcard represents any 1 character, and the * wildcard represents any number of characters.

In [23]:
# Import necessary modules
import glob
import pandas as pd

# Write the pattern: pattern
pattern = 'part_?.csv'

# Save all file matches: csv_files
csv_files = glob.glob('*.csv')


# Print the file names
print(csv_files)

# Load the second file into a DataFrame: csv2
csv2 = pd.read_csv(csv_files[0])
# Print the head of csv2
print(csv2.head())

['Abalone.csv', 'agaricus-lepiota.csv', 'airquality.csv', 'athlete_events.csv', 'car_sales.csv', 'DATASET_LEAD_GENERATION_PURCHASE_MOD.csv', 'HR Analytics.csv', 'iris.csv', 'movies.csv', 'PURCHASE1.csv', 'PythonExport.csv', 'ratings.csv', 'Telco-Customer-Churn.csv']
  sex  length  diameter  height  weight.w  weight.s  weight.v  weight.sh  \
0   M   0.455     0.365   0.095    0.5140    0.2245    0.1010      0.150   
1   M   0.350     0.265   0.090    0.2255    0.0995    0.0485      0.070   
2   F   0.530     0.420   0.135    0.6770    0.2565    0.1415      0.210   
3   M   0.440     0.365   0.125    0.5160    0.2155    0.1140      0.155   
4   I   0.330     0.255   0.080    0.2050    0.0895    0.0395      0.055   

   rings  
0     15  
1      7  
2      9  
3     10  
4      7  


### Iterating and concatenating all matches

    - Now that you have a list of filenames to load, you can load all the files into a list of DataFrames that can then be concatenated.

    - You'll start with an empty list called frames. Your job is to use a for loop to:

    - iterate through each of the filenames
    - read each filename into a DataFrame, and then
    - append it to the frames list.
    - You can then concatenate this list of DataFrames using pd.concat(). Go for it!

In [26]:
# Create an empty list: frames
frames = []

#  Iterate over csv_files
for csv in csv_files:

    #  Read csv into a DataFrame: df
    df = pd.read_csv(csv)
   
    
    # Append df to frames
    frames.append(df)

# Concatenate frames into a single DataFrame: uber
uber = pd.concat(frames,axis =1)

# Print the shape of uber
print(uber.shape)

# Print the head of uber
print(uber.head())


(271116, 178)
  sex  length  diameter  height  weight.w  weight.s  weight.v  weight.sh  \
0   M   0.455     0.365   0.095    0.5140    0.2245    0.1010      0.150   
1   M   0.350     0.265   0.090    0.2255    0.0995    0.0485      0.070   
2   F   0.530     0.420   0.135    0.6770    0.2565    0.1415      0.210   
3   M   0.440     0.365   0.125    0.5160    0.2155    0.1140      0.155   
4   I   0.330     0.255   0.080    0.2050    0.0895    0.0395      0.055   

   rings class  ...  DeviceProtection TechSupport StreamingTV StreamingMovies  \
0   15.0     p  ...                No          No          No              No   
1    7.0     e  ...               Yes          No          No              No   
2    9.0     e  ...                No          No          No              No   
3   10.0     p  ...               Yes         Yes          No              No   
4    7.0     e  ...                No          No          No              No   

         Contract PaperlessBilling        

    - You can now programmatically combine datasets that are broken up into many smaller parts. 
    - You'll find many datasets in the wild will be stored this way, particularly data that is collected incrementally.

### 1-to-1 data merge

    - Merging data allows you to combine disparate datasets into a single dataset to do more complex analysis.

In [50]:
site = pd.DataFrame({'name': ['DR-1','DR-2','MSK-4'],'lat': [-49.85,-47.15,-48.87],'long':[-128.57,-126.72,-123.40]})
site

Unnamed: 0,name,lat,long
0,DR-1,-49.85,-128.57
1,DR-2,-47.15,-126.72
2,MSK-4,-48.87,-123.4


In [51]:
visited = pd.DataFrame({'ident' :[619,622,734],'site':['DR-1','MSK-4','DR-2'],
                        'dated':['1927-02-08','1927-02-10','1939-01-07']})
visited

Unnamed: 0,ident,site,dated
0,619,DR-1,1927-02-08
1,622,MSK-4,1927-02-10
2,734,DR-2,1939-01-07


In [52]:
# Merge the DataFrames: o2o
o2o = pd.merge(left= site, right= visited, left_on='name', right_on='site')

# Print o2o
print(o2o)

    name    lat    long  ident   site       dated
0   DR-1 -49.85 -128.57    619   DR-1  1927-02-08
1   DR-2 -47.15 -126.72    734   DR-2  1939-01-07
2  MSK-4 -48.87 -123.40    622  MSK-4  1927-02-10


    - Notice the 1-to-1 correspondence between the name column of the site DataFrame and the site column of the visited
      DataFrame. 
    - This is what made the 1-to-1 merge possible.

### Many-to-1 data merge

    - In a many-to-one (or one-to-many) merge, one of the values will be duplicated and recycled in the output. 
    - That is, one of the keys in the merge is not unique.

    - Here, the two DataFrames site and visited have been pre-loaded once again.
    - Note that this time, visited has multiple entries for the site column. 
    - Confirm this by exploring it in the IPython Shell.

In [60]:
site = pd.DataFrame({'name': ['DR-1','DR-2','MSK-4'],'lat': [-49.85,-47.15,-48.87],'long':[-128.57,-126.72,-123.40]})
site

Unnamed: 0,name,lat,long
0,DR-1,-49.85,-128.57
1,DR-2,-47.15,-126.72
2,MSK-4,-48.87,-123.4


In [61]:
visited = pd.DataFrame({'ident' :[619,622,734,735,751,752,837,844],'site':['DR-1','DR-1','DR-3','DR-3','DR-3','DR-3','MSK-4','DR-1'],
                        'dated':['1927-02-08','1927-02-10','1939-01-07','1930-01-12','1930-02-26','NaN','1932-01-14','1932-03-22']})
visited

Unnamed: 0,ident,site,dated
0,619,DR-1,1927-02-08
1,622,DR-1,1927-02-10
2,734,DR-3,1939-01-07
3,735,DR-3,1930-01-12
4,751,DR-3,1930-02-26
5,752,DR-3,
6,837,MSK-4,1932-01-14
7,844,DR-1,1932-03-22


In [62]:
# Merge the DataFrames: m2o
m2o = pd.merge(left = site,right = visited,left_on='name',right_on='site')

# Print m2o
print(m2o)

    name    lat    long  ident   site       dated
0   DR-1 -49.85 -128.57    619   DR-1  1927-02-08
1   DR-1 -49.85 -128.57    622   DR-1  1927-02-10
2   DR-1 -49.85 -128.57    844   DR-1  1932-03-22
3  MSK-4 -48.87 -123.40    837  MSK-4  1932-01-14


    - The .merge() method call is the same as the 1-to-1 merge from the previous exercise, 
    - but the data and output will be different.

    - Notice how the site data is duplicated during this many-to-1 merge!

### Many-to-many data merge

    - The final merging scenario occurs when both DataFrames do not have unique keys for a merge. 
    - What happens here is that for each duplicated key, every pairwise combination will be created.

In [63]:
df1 = pd.DataFrame({'c1' : ['a','a','b','b'],'c2':[1,2,3,4]})
print(df1)
df2 = pd.DataFrame({'c1' : ['a','a','b','b'],'c2':[10,20,30,40]})
print(df2)
df3 = pd.merge(left = df1,right = df2,left_on='c1',right_on='c1')
df3


  c1  c2
0  a   1
1  a   2
2  b   3
3  b   4
  c1  c2
0  a  10
1  a  20
2  b  30
3  b  40


Unnamed: 0,c1,c2_x,c2_y
0,a,1,10
1,a,1,20
2,a,2,10
3,a,2,20
4,b,3,30
5,b,3,40
6,b,4,30
7,b,4,40


In [None]:
# Merge site and visited: m2m
print(visited)
m2m = pd.merge(left=site, right=visited, left_on='name', right_on='site')

# Merge m2m and survey: m2m
m2m = pd.merge(left=m2m, right=survey, left_on='ident', right_on='taken')

# Print the first 20 lines of m2m
print(m2m.head(20))