# Read in Files

Let's reinforce what is possibel when reading in files into pandas dataframes:

In [1]:
# setup python
import pandas as pd
import numpy as np

# CSV files - Basics

We have seen that csv files can be parsed easily, but we should explore the options availble to us:

In [2]:
# read in a csv file from the url
url = "https://public.tableau.com/s/sites/default/files/media/titanic%20passenger%20list.csv"
titanic = pd.read_csv(url)

In [3]:
# what do we have
titanic.shape

(1309, 14)

In [4]:
# first few
titanic.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [5]:
# columns
titanic.columns

Index(['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket',
       'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest'],
      dtype='object')

# CSV File - Keep certain columns

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

^^ Exam will allow access to any/all resources except each other, helpful to read the documentation

In [6]:
# use the same titanic file,  but only keep name, age, ticket and fare on import
titanic_small = pd.read_csv(url, usecols = ["name", "age", "ticket", "fare"])
titanic_small.head()

Unnamed: 0,name,age,ticket,fare
0,"Allen, Miss. Elisabeth Walton",29.0,24160,211.3375
1,"Allison, Master. Hudson Trevor",0.92,113781,151.55
2,"Allison, Miss. Helen Loraine",2.0,113781,151.55
3,"Allison, Mr. Hudson Joshua Creighton",30.0,113781,151.55
4,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",25.0,113781,151.55


# CSV File (but any file) - Set the row labels

When we do not specify the row labels, we get 0:n.  The numeric value acts as both the numeric index and the label for the rows.  In some cases, you may want to specifiy the row label, we can set that when we read files into dataframes.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

In [0]:
# bring in the cars dataset
cars_url = "https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv"
cars = pd.read_csv(cars_url)
cars.head()

In [0]:
## model is the first column, so we use 0 for index
cars2 = pd.read_csv(cars_url, index_col=0)
cars2.head()

# Excel Files

We saw in the assignment that you can read in Excel files.  It's simple with read_excel



In [0]:
# bring in an excel dataset
pets_url = "https://public.tableau.com/s/sites/default/files/media/Resources/catsvdogs.xlsx"
pets = pd.read_excel(pets_url)
pets.head()

In [0]:
# shape of the dataset
pets.shape

In [0]:
# we could have set the states as the row names
pets.Location

In [0]:
# set the state as the row index
pets2 = pd.read_excel(pets_url, index_col=0)
pets2.head()

# Excel Files with Multiple sheets

Pandas can also handle workbooks with multiple sheets, we just need to use the ExcelWorkbook.



In [0]:
# bring in an excel workbook
govt_url = "https://public.tableau.com/s/sites/default/files/media/Resources/113thCongress.xlsx"
congress = pd.ExcelFile(govt_url)

In [0]:
# what do we have?
congress

In [0]:
# there is probably something that lets us look at sheets, right?
congress.sheet_names

In [0]:
# "Parse" the sheet into a dataframe
bills = congress.parse('Bills')

In [0]:
type(bills)

In [0]:
bills.head()

# Tab-delimited files

We have already seen that pandas and read_csv can handle multiple data types.  Let's do one more common example that you might see, tab-delimited.

In [0]:
tab_url = "https://gist.githubusercontent.com/mbostock/3305937/raw/a5be7c5fd55c4fa0ca8a400cb68d658a40989966/data.tsv"
tab_df = pd.read_csv(tab_url)
3tab_df.head()

In [0]:
# just like in the assignment, we can see the delimiter, its a \t
tab_df2 = pd.read_csv(tab_url, delimiter="\t")
tab_df2.head()