### Combining rows of data

uber1, which contains data for April 2014, uber2, which
contains data for May 2014, and uber3, which contains data for June 2014. We will concatenate these DataFrames together such that the resulting DataFrame has the data for all
three months.

In [12]:
import pandas as pd

# Assign url of file: url
url= 'https://raw.githubusercontent.com/ksatola/Data-Science-Notes/master/data/uber1.csv'

# Read file into a DataFrame: df
uber1= pd.read_csv(url, index_col=0)

uber1.to_csv("Data/uber1.csv")

# Print the head of the DataFrame
uber1.head()

Unnamed: 0,Date/Time,Lat,Lon,Base
0,4/1/2014 0:11:00,40.769,-73.9549,B02512
1,4/1/2014 0:17:00,40.7267,-74.0345,B02512
2,4/1/2014 0:21:00,40.7316,-73.9873,B02512
3,4/1/2014 0:28:00,40.7588,-73.9776,B02512
4,4/1/2014 0:33:00,40.7594,-73.9722,B02512


In [14]:
# Assign url of file: url
url= 'https://raw.githubusercontent.com/ksatola/Data-Science-Notes/master/data/uber2.csv'

# Read file into a DataFrame: df
uber2= pd.read_csv(url, index_col=0)

uber2.to_csv("Data/uber2.csv")

# Print the head of the DataFrame
uber2.head()

Unnamed: 0,Date/Time,Lat,Lon,Base
0,5/1/2014 0:02:00,40.7521,-73.9914,B02512
1,5/1/2014 0:06:00,40.6965,-73.9715,B02512
2,5/1/2014 0:15:00,40.7464,-73.9838,B02512
3,5/1/2014 0:17:00,40.7463,-74.0011,B02512
4,5/1/2014 0:17:00,40.7594,-73.9734,B02512


In [13]:
# Assign url of file: url
url= 'https://raw.githubusercontent.com/ksatola/Data-Science-Notes/master/data/uber3.csv'

# Read file into a DataFrame: df
uber3= pd.read_csv(url, index_col=0)

uber3.to_csv("Data/uber3.csv")

# Print the head of the DataFrame
uber3.head()

Unnamed: 0,Date/Time,Lat,Lon,Base
0,6/1/2014 0:00:00,40.7293,-73.992,B02512
1,6/1/2014 0:01:00,40.7131,-74.0097,B02512
2,6/1/2014 0:04:00,40.3461,-74.661,B02512
3,6/1/2014 0:04:00,40.7555,-73.9833,B02512
4,6/1/2014 0:07:00,40.688,-74.1831,B02512


In [6]:
# Concatenate uber1, uber2, and uber3: row_concat
row_concat = pd.concat([uber1,uber2,uber3])

# Print the shape of row_concat
print(row_concat.shape)

# Print the head of row_concat
print(row_concat.head())

(297, 4)
          Date/Time      Lat      Lon    Base
0  4/1/2014 0:11:00  40.7690 -73.9549  B02512
1  4/1/2014 0:17:00  40.7267 -74.0345  B02512
2  4/1/2014 0:21:00  40.7316 -73.9873  B02512
3  4/1/2014 0:28:00  40.7588 -73.9776  B02512
4  4/1/2014 0:33:00  40.7594 -73.9722  B02512


### Combining columns of data
Think of column-wise concatenation of data as stitching data together from the sides instead of
the top and bottom. To perform this action, you use the same `pd.concat()` function, but this time
with the keyword argument `axis=1`. The default, axis=0, is for a row-wise concatenation.

In [10]:
uber_x1 = row_concat[['Date/Time', 'Lat']]
uber_x2 = row_concat[['Lon', 'Base']]
print(uber_x1.head())
print(uber_x2.head())

          Date/Time      Lat
0  4/1/2014 0:11:00  40.7690
1  4/1/2014 0:17:00  40.7267
2  4/1/2014 0:21:00  40.7316
3  4/1/2014 0:28:00  40.7588
4  4/1/2014 0:33:00  40.7594
       Lon    Base
0 -73.9549  B02512
1 -74.0345  B02512
2 -73.9873  B02512
3 -73.9776  B02512
4 -73.9722  B02512


In [11]:
# Concatenate column-wise: uber
uber = pd.concat([uber_x1, uber_x2], axis=1)

# Print the shape of uber
print(uber.shape)

# Print the head of uber
print(uber.head())

(297, 4)
          Date/Time      Lat      Lon    Base
0  4/1/2014 0:11:00  40.7690 -73.9549  B02512
1  4/1/2014 0:17:00  40.7267 -74.0345  B02512
2  4/1/2014 0:21:00  40.7316 -73.9873  B02512
3  4/1/2014 0:28:00  40.7588 -73.9776  B02512
4  4/1/2014 0:33:00  40.7594 -73.9722  B02512


### Finding files that match a pattern
You're now going to practice using the glob module to find all csv files in the workspace.
In the next exercise, you'll programmatically load them into DataFrames.
The glob module has a function called glob that takes a
pattern and returns a list of the files in the working directory that match that pattern.
For example, if you know the pattern is part_ single digit number .csv, you can write the
pattern as 'part_?.csv' (which would match part_1.csv, part_2.csv, part_3.csv, etc.)
Similarly, you can find all .csv files with '*.csv', or all parts with 'part_*'. The ?
wildcard represents any 1 character, and the * wildcard represents any number of characters.

In [18]:
# Import necessary modules
import glob

# Write the pattern: pattern
pattern = 'Data/uber*.csv'

# Save all file matches: csv_files
csv_files = glob.glob(pattern)

# Print the file names
print(csv_files)

# Load the second file into a DataFrame: csv2
csv2 = pd.read_csv(csv_files[1])

# Print the head of csv2
print(csv2.head())

['Data\\uber1.csv', 'Data\\uber2.csv', 'Data\\uber3.csv']
   Unnamed: 0         Date/Time      Lat      Lon    Base
0           0  5/1/2014 0:02:00  40.7521 -73.9914  B02512
1           1  5/1/2014 0:06:00  40.6965 -73.9715  B02512
2           2  5/1/2014 0:15:00  40.7464 -73.9838  B02512
3           3  5/1/2014 0:17:00  40.7463 -74.0011  B02512
4           4  5/1/2014 0:17:00  40.7594 -73.9734  B02512


### Iterating and concatenating all matches
Now that you have a list of filenames to load, you can load all the files into a list of DataFrames that can then be concatenated.
You'll start with an empty list called frames. Your job is to use a for loop to iterate through each of the filenames, read each filename into a DataFrame, and then append it to the frames list.
You can then concatenate this list of DataFrames using `pd.concat()`.

In [23]:
# Create an empty list: frames
frames = []

#  Iterate over csv_files
for csv in csv_files:

    #  Read csv into a DataFrame: df
    df = pd.read_csv(csv, index_col=0)
    
    # Append df to frames
    frames.append(df)
    

# Concatenate frames into a single DataFrame: uber
uber = pd.concat(frames)

# Print the shape of uber
print(uber.shape)

# Print the head of uber
print(uber.head())

(297, 4)
          Date/Time      Lat      Lon    Base
0  4/1/2014 0:11:00  40.7690 -73.9549  B02512
1  4/1/2014 0:17:00  40.7267 -74.0345  B02512
2  4/1/2014 0:21:00  40.7316 -73.9873  B02512
3  4/1/2014 0:28:00  40.7588 -73.9776  B02512
4  4/1/2014 0:33:00  40.7594 -73.9722  B02512


### Merging data

- 1-to-1 data merge

Merging data allows you to combine disparate datasets into a single dataset to do more complex analysis.

Your task is to perform a 1-to-1 merge of two DataFrames using the 'name' column of site and the 'site' column of visited.

In [39]:
site = pd.read_csv('Data/site.csv')
site = site.drop(['Unnamed: 0'], axis=1)
print(site.shape)
site.head()

(3, 3)


Unnamed: 0,name,lat,long
0,DR-1,-49.85,-128.57
1,DR-3,-47.15,-126.72
2,MSK-4,-48.87,-123.4


In [43]:
visited = pd.read_csv('Data/visited.csv')
visited = visited.drop(['Unnamed: 0'], axis=1)
print(visited.shape)
visited.head()

(3, 3)


Unnamed: 0,ident,site,dated
0,734,DR-3,1939-01-07
1,837,MSK-4,1932-01-14
2,619,DR-1,1927-02-08


In [35]:
# Merge the DataFrames: m2o
m2o = pd.merge(left=site, right=visited, left_on='name', right_on='site')

# Print m2o
print(m2o)

    name    lat    long  ident   site       dated
0   DR-1 -49.85 -128.57    619   DR-1  1927-02-08
1   DR-3 -47.15 -126.72    734   DR-3  1939-01-07
2  MSK-4 -48.87 -123.40    837  MSK-4  1932-01-14


- Many-to-1 data merge
In a many-to-one (or one-to-many) merge, one of the values will be duplicated and recycled in the output. That is, one of the keys in the merge is not unique.

Note that this time, visited has multiple entries for the site column. 
The `.merge()` method call is the same as the 1-to-1 merge from the previous exercise, but the data and output will be different.

In [48]:
visited_2 = pd.read_csv('data/visited_2.csv')
visited_2 = visited_2.drop(['Unnamed: 0'], axis=1)
print(visited_2.shape)
visited_2

(8, 3)


Unnamed: 0,ident,site,dated
0,619,DR-1,1927-02-08
1,622,DR-1,1927-02-10
2,734,DR-3,1939-01-07
3,735,DR-3,1930-01-12
4,751,DR-3,1930-02-26
5,752,DR-3,
6,837,MSK-4,1932-01-14
7,844,DR-1,1932-03-22


In [49]:
# Merge the DataFrames: m2o
m2o = pd.merge(left=site, right=visited_2, left_on='name', right_on='site')

# Print m2o
print(m2o)

    name    lat    long  ident   site       dated
0   DR-1 -49.85 -128.57    619   DR-1  1927-02-08
1   DR-1 -49.85 -128.57    622   DR-1  1927-02-10
2   DR-1 -49.85 -128.57    844   DR-1  1932-03-22
3   DR-3 -47.15 -126.72    734   DR-3  1939-01-07
4   DR-3 -47.15 -126.72    735   DR-3  1930-01-12
5   DR-3 -47.15 -126.72    751   DR-3  1930-02-26
6   DR-3 -47.15 -126.72    752   DR-3         NaN
7  MSK-4 -48.87 -123.40    837  MSK-4  1932-01-14


- Many-to-many data merge

The final merging scenario occurs when both DataFrames do not have unique keys for a merge. What happens here is that for each duplicated key, every pairwise combination will be created.

Here, you'll work with the site and visited DataFrames from before, and a new survey DataFrame. Your task is to merge site and visited as you did in the earlier exercises. You will then merge this merged DataFrame with survey.


In [55]:
# Read file into a DataFrame: df
survey_site= pd.read_csv("Data/survey_site.csv", index_col=0)

# Print the head of the DataFrame
survey_site

Unnamed: 0,taken,person,quant,reading
0,619,dyer,rad,9.82
1,619,dyer,sal,0.13
2,622,dyer,rad,7.8
3,622,dyer,sal,0.09
4,734,pb,rad,8.41
5,734,lake,sal,0.05
6,734,pb,temp,-21.5
7,735,pb,rad,7.22
8,735,,sal,0.06
9,735,,temp,-26.0


In [53]:
# Merge site and visited: m2m
m2m = pd.merge(left=site, right=visited, left_on='name', right_on='site')

m2m

Unnamed: 0,name,lat,long,ident,site,dated
0,DR-1,-49.85,-128.57,619,DR-1,1927-02-08
1,DR-3,-47.15,-126.72,734,DR-3,1939-01-07
2,MSK-4,-48.87,-123.4,837,MSK-4,1932-01-14


In [54]:
# Merge m2m and survey: m2m
m2m = pd.merge(left=m2m, right=survey_site, left_on='ident', right_on='taken')

m2m

Unnamed: 0,name,lat,long,ident,site,dated,taken,person,quant,reading
0,DR-1,-49.85,-128.57,619,DR-1,1927-02-08,619,dyer,rad,9.82
1,DR-1,-49.85,-128.57,619,DR-1,1927-02-08,619,dyer,sal,0.13
2,DR-3,-47.15,-126.72,734,DR-3,1939-01-07,734,pb,rad,8.41
3,DR-3,-47.15,-126.72,734,DR-3,1939-01-07,734,lake,sal,0.05
4,DR-3,-47.15,-126.72,734,DR-3,1939-01-07,734,pb,temp,-21.5
5,MSK-4,-48.87,-123.4,837,MSK-4,1932-01-14,837,lake,rad,1.46
6,MSK-4,-48.87,-123.4,837,MSK-4,1932-01-14,837,lake,sal,0.21
7,MSK-4,-48.87,-123.4,837,MSK-4,1932-01-14,837,roe,sal,22.5
