# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Intro to Pandas I

Week 2 | Day 1

### LEARNING OBJECTIVES
*After this lesson, you will be able to:*
- Describe the purpose of pandas library
- Explain the fundamental data structures of pandas
- Read in a csv file using pandas
- Utilize the following methods: head, index, columns, max, min, unique, nunique, describe
- Select and slice data
- Use basic histogram plotting in pandas

## What is pandas?

![](http://i.imgur.com/OKffmnL.png)

pandas is a Python package providing **fast, flexible, and expressive data structures** designed to make working with “relational” or “labeled” data both easy and intuitive. [I]t has the broader **goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language**. It is already well on its way toward this goal.

## Lets' talk about those data structures

The two basic data structures are:
    
- Series
- DataFarm

## Series

A series is a 1D data structure. A series always has an index and optionally a column name.

In [1]:
my_series = pd.Series([10, 20, 30, 40, 50])
my_series

NameError: name 'pd' is not defined

In [None]:
my_series = pd.Series([10, 20, 30, 40, 50], index=[2012, 2013, 2014, 2015, 2016])
my_series

## DataFrame

A DataFrame is a 2D data structure. It also has and index, and each column (itself a series) in the DataFrame has a column name. A DataFrame can be thought of as being similar to a spreadsheet in structure.

<img src="http://i.imgur.com/Z5PAHRQ.png" width=400>

## Let's import pandas and read in a csv

[Salary Dataset Description](https://vincentarelbundock.github.io/Rdatasets/doc/car/Salaries.html)

In [None]:
# how we import pandas
import pandas as pd

# how to read in a csv
df = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/car/Salaries.csv')
df

## Let's fix that existing index from the csv

In [None]:
# usecols parameter changes what goes into our DataFrame

df = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/car/Salaries.csv',\
                 usecols=range(1, 7))
df

## Changing the headers on csv_read

In [None]:
tdf = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/datasets/HairEyeColor.csv',\
                  usecols=range(1,4))
tdf

## Changing the headers on csv_read

In [None]:
tdf = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/datasets/HairEyeColor.csv',\
                  usecols=range(1,4), header=None)
tdf

## Accessing and renaming the columns

In [None]:
df.columns

In [None]:
## Rename our columns

df.columns = [x.replace('.', '_') if '.' in x else x \
              for x in df.columns]
df.columns

## Accessing and replacing the index

In [None]:
df.index

In [None]:
df.index = [x for x in df.index]

In [None]:
df.index

## Exercise

- Read in a csv from the following page: [Data CSVs](https://vincentarelbundock.github.io/Rdatasets/datasets.html)
- Change the columns using the usecols parameter
- Select only the even columns with a list comprehension
- Change the header parameter to remove the existing header
- Try using the skiprows parameter to exclude the first few rows
- Try renaming the columns

## Going back to our salary dataset

## Let's add a couple more imports as well

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## Let's see the head

In [None]:
df.head()

## And now the tail

In [None]:
df.tail()

## We can pass arguments to those

In [None]:
df.head(20)

## Let's slice it up

In [None]:
# using only the single brackets works on the rows
df[0:3]

## What if I want to select an individual column?

In [None]:
df['rank']

## Why did that look different?

In [None]:
type(df)

In [None]:
type(df['rank'])

## Double brackets to the rescue

In [None]:
df[['rank']]

## And we can now see the type

In [None]:
type(df[['rank']])

## What if I want more than one column?

In [None]:
df.columns

In [None]:
list_of_cols_i_want = ['rank', 'discipline', 'sex']

df[list_of_cols_i_want]

## What if I want to pick my row and columns together?

In [None]:
# .iloc[rows to select:columns to select]
df.iloc[0:10,2:4]

## What if I just want one cell?

In [None]:
# the comma separates the row indexer from the column indexer
df.iloc[5,4]

## Exercise

Use the dataset you previously read in to do the following:
- Select the first and third columns two different ways 
- Using .iloc, select the second column's second data point
- Using .iloc select the first 10 rows of the 2rd through the 4th column
- Select the 'discipline' column as a Series and then as a DataFrame
- Use single bracket notation to select the first 8 rows

## Let's look at some summary statistics

In [None]:
df.describe()

## Let's customize that...

In [None]:
df.describe(percentiles=[.10,.20,.50,.80,.95])

## Let's get some data type info

In [None]:
df.info()

## Finding the max and min values

In [None]:
df['salary'].max()

In [None]:
df['salary'].min()

## Finding the uniques

In [None]:
df['rank'].unique()

In [None]:
df['rank'].nunique()

## Exercise

Using professor's salaries data:
- Find the max and min years since phd
- Find the unique and number of unique years of service
- Find the 99% percentile of years since phd
- Find the 99% percentile of salary

## Let's plot a histogram

In [None]:
df['salary'].hist()

## Let's customize that

In [None]:
df['salary'].hist(figsize=(10,8), color='k')

## And even more customization...

In [None]:
fig = df['yrs_since_phd'].hist(figsize=(12,6), color='darkorange')
plt.style.use('fivethirtyeight')

# get the current axis
ax = plt.gca()

ax.set_ylabel('Frequency', fontsize=14)
ax.set_xlabel('Years', fontsize=14)
ax.set_title('A Chart of Stuff', y=1.01)

## Independent Exercise

Pick any dataset that interestes you from the following page [datasets](https://vincentarelbundock.github.io/Rdatasets/datasets.html)

- Read in the data with no parameter changes
- Modify which columns are read in using the header and usecols parameters
- Rename the headers
- Examine the data using head and tail
- What are the data types of each column?
- Try to change the data types (see pandas documentation)
- Examine the data to find the number of unique values
- What are the unique values?
- Select some a column to plot a histogram of
- Change the x and y labels and the title of the plot

## Conclusion:

In this lecture we've learned:
- What pandas is
- Why pandas is used
- What a DataFrame and a Series is
- How to index and slice in pandas
- How to view the head and tail and get descriptive stats
- How to view and change the columns and index
- The basics of plotting with pandas