In [None]:
import pandas as pd
import glob

# Load CSV

We'll start our journey by loading CSV (Comma Separated Values) Files. 

These might look like this:
```
Title, Date, Sales
The Matrix, 2020-01-04, 44553
Contagion, 2020-04-14, 120000
Idiocracy, 2016-11-06, 303421
```


### Headers

CSV Files sometimes include headers or column names to make it easier to understand what data you're looking at.

We'll get started by reading Street Sweeping Route data from the City of Los Angeles.

https://data.lacity.org/A-Livable-and-Sustainable-City/Posted-Street-Sweeping-Routes/krk7-ayq2

In [None]:
street_df = pd.read_csv('street_sweeping.csv')
street_df

In [None]:
street_df.describe()

In [None]:
street_df.columns

In [None]:
street_df.loc[0:4,['Route No', 'Boundaries']]

### No Headers
Now let's assume we received a file without a header.

For this we'll use an altered version of the Los Angeles Animal Services Intake data file.

https://data.lacity.org/A-Well-Run-City/Animal-Services-Intake-Data/8cmr-fbcu

In [None]:
animal_serv_df = pd.read_csv('animal_services.csv')

In [None]:
animal_serv_df.head()

Well, that didn't go as we planned. We need to tell python that we don't have a header and set the column names ourselves.

In [None]:
col_names = ['Shelter',
             'Animal ID#',
             'Intake Date',
             'Intake Type',
             'Intake Condition',
             'Animal Type',
             'Group',
             'Breed 1',
             'Breed 2']

animal_serv_df = pd.read_csv('animal_services.csv', 
                             header = None, 
                             names = col_names)

In [None]:
animal_serv_df.head()

#### Changing Column Names
We can also change column names after reading in a csv.

In [None]:
new_col_names = ['shelter_area',
                 'animal_id',
                 'intake_date',
                 'intake_type',
                 'intake_condition',
                 'animal_type',
                 'animal_group',
                 'breed_info',
                 'breed_info_aux']

animal_serv_df.columns = new_col_names

animal_serv_df.head()

### Non-Comma Separated

Even though they are called 'Comma Separated', the fields may be separated by other values as well. 

Now let's read a csv that doesn't commas. We'll first attempt to read it without changing anything and see what happens. 

In [None]:
parking_df = pd.read_csv('parking_meter.csv')
parking_df.head()

As you can see, python doesn't know how to split up the data into columns. It just assumed that each line was a column by itself. 
Let's try again by specifying a *delimiter*.

In [None]:
parking_df = pd.read_csv('parking_meter.csv', 
                         sep = '|')
parking_df.head()

### Handling Blanks

In [None]:
parking_missing_df = pd.read_csv('parking_meter_missing.csv', 
                                 sep = '|')
parking_missing_df

### Marking Data as Na/Nan

Let's mark the 'UNKNOWN' values as NaN.

In [None]:
parking_missing_df = pd.read_csv('parking_meter_missing.csv', 
                                 delimiter = '|', 
                                 na_values = 'UNKNOWN')
parking_missing_df.head()

# Load Excel

We can load Excel files in the same way. Excel workbooks can have multiple sheets, which we will have to deal with.

### Single Sheet
We'll start by reading an excel workbook with a single sheet.

In [None]:
lax_parking_df = pd.read_excel('lax_parking.xlsx')
lax_parking_df.head(3)

### Multisheet Workbook

Now we'll work with a workbook with multiple sheets.

Let's see what sheets are in this workbook.

In [None]:
xlsx = pd.ExcelFile('events_and_streets.xlsx')
print(xlsx.sheet_names)

#### Load a Single Sheet
First we'll read in a single sheet.



In [None]:
city_events_df = pd.read_excel(xlsx, sheet_name = 'city_events')
city_events_df.head(3)

#### Load All Sheets - into separate dataframe in a dictionary

In [None]:
data = {}
with pd.ExcelFile('events_and_streets.xlsx') as xl:
    for sheet in xl.sheet_names:
        data[sheet] = pd.read_excel(xl, sheet_name = sheet)
        
data.keys()

In [None]:
data['street_names'].head()

#### Load all sheets into the same dataframe

*You are given the annual data for Hexacorp with each quarter in a separate sheet. All columns are the same, you want to load them all into the same dataframe. You also want to add a new field called 'quarter'. (Luckily for us, the sheetnames are the quarter)*

In [None]:
data = {}
with pd.ExcelFile('hexacorp_2019.xlsx') as xl:
    for sheet in xl.sheet_names:
        data[sheet] = pd.read_excel(xl, sheet_name = sheet)
        data[sheet]['quarter'] = sheet

hex_2019_df = pd.concat(data)
hex_2019_df.shape

In [None]:
hex_2019_df.head()

In [None]:
hex_2019_df.tail()

### Load All the Things

Now we'll load all excel sheets in a given folder.

In [None]:
all_data = {}

for file in glob.glob("hexacorp*.xlsx"):
    print('Reading file: {file}'.format(file = file))
    with pd.ExcelFile(file) as xl:
        for sheet in xl.sheet_names:
            all_data[file+sheet] = pd.read_excel(xl, sheet_name = sheet)
            all_data[file+sheet]['source_file'] = file
            all_data[file+sheet]['quarter'] = sheet

hex_all_df = pd.concat(all_data, ignore_index=True)
hex_all_df.shape

In [None]:
hex_all_df.head()

In [None]:
hex_all_df.tail()