# Introduction to Pandas

In this workbook, you will learn [Pandas](http://pandas.pydata.org/), which provides a convenient data structure for storing and manipulating data called a `DataFrame`.

In [1]:
import pandas as pd
pd.set_option("display.max_rows", 15)

## Reading in Data

When you use Pandas to read in data from a CSV, it automatically infers the types of the columns.

In [2]:
data = pd.read_csv("/data/harris.csv")
data

Unnamed: 0,Bsal,Sal77,Sex,Senior,Age,Educ,Exper
0,5040,12420,Male,96,329,15,14.0
1,6300,12060,Male,82,357,15,72.0
2,6000,15120,Male,67,315,15,35.5
3,6000,16320,Male,97,354,12,24.0
4,6000,12300,Male,66,351,12,56.0
5,6840,10380,Male,92,374,15,41.5
6,8100,13980,Male,66,369,16,54.5
...,...,...,...,...,...,...,...
86,5100,10560,Female,84,458,12,36.0
87,4800,9240,Female,84,571,16,214.0


## Basic Indexing and Selection

Let's check that Pandas inferred the types reasonably. You can access columns either using dictionary-style notation...

In [59]:
data["Bsal"]

0     5040
1     6300
2     6000
3     6000
4     6000
5     6840
6     8100
      ... 
86    5100
87    4800
88    6000
89    4380
90    5580
91    4620
92    5220
Name: Bsal, dtype: int64

...or attribute-style notation.

In [4]:
data.Sex

0       Male
1       Male
2       Male
3       Male
4       Male
5       Male
6       Male
       ...  
86    Female
87    Female
88    Female
89    Female
90    Female
91    Female
92    Female
Name: Sex, dtype: object

What happens if you index by a slice (e.g., `2:5`)?

In [6]:
data[2:5]

Unnamed: 0,Bsal,Sal77,Sex,Senior,Age,Educ,Exper
2,6000,15120,Male,67,315,15,35.5
3,6000,16320,Male,97,354,12,24.0
4,6000,12300,Male,66,351,12,56.0


What happens if you index by a Boolean mask?

In [25]:
data[data.Sex == "Female"]

Unnamed: 0,Bsal,Sal77,Sex,Senior,Age,Educ,Exper
14,5100,8940,Female,95,640,15,165.0
15,4800,8580,Female,98,774,12,381.0
16,5280,8760,Female,98,557,8,190.0
17,5280,8040,Female,88,745,8,90.0
18,4800,9000,Female,77,505,12,63.0
19,4800,8820,Female,76,482,12,6.0
20,5400,13320,Female,86,329,15,24.0
...,...,...,...,...,...,...,...
86,5100,10560,Female,84,458,12,36.0
87,4800,9240,Female,84,571,16,214.0


What happens if you try to get the second row by `data[1]`? Can you think of a reason why this doesn't work?

## Pandas Indexers

To index specific rows, Pandas offers the `.loc` and `.iloc` indexers.

- `.loc` indexes by label.
- `.iloc` indexes by position.

By default, when you read in data, the index label _is_ just its position, so there is no difference between the two. However, we will see an example shortly where the index label is not just its position. In these cases, you usually will want to use `.loc`.

In [61]:
data.loc[1]

Bsal            6300
Sal77          12060
Sex             Male
Senior            82
Age              357
Educ              15
Exper             72
AgeInYears        29
SalIncrease     5760
Name: 1, dtype: object

In [28]:
data.iloc[1]

Bsal       6300
Sal77     12060
Sex        Male
Senior       82
Age         357
Educ         15
Exper        72
Name: 1, dtype: object

You can pass in both row and column indexers to `.loc` or `.iloc`.

In [65]:
data.loc[1, "Bsal"]

6300

You can even pass in a slice of column labels! This is a quick way to get many columns at once, without having to manually type in each of the names of the columns.

In [30]:
data.loc[:, "Senior":"Exper"]

Unnamed: 0,Senior,Age,Educ,Exper
0,96,329,15,14.0
1,82,357,15,72.0
2,67,315,15,35.5
3,97,354,12,24.0
4,66,351,12,56.0
5,92,374,15,41.5
6,66,369,16,54.5
...,...,...,...,...
86,84,458,12,36.0
87,84,571,16,214.0


## Adding New Columns

We can add new columns using dictionary assignment. For example, the column `Age` is in months. Let's create a new column, `AgeInYears`, which converts `Age` to years.

In [39]:
data["AgeInYears"] = data["Age"] // 12
data

Unnamed: 0,Bsal,Sal77,Sex,Senior,Age,Educ,Exper,AgeInYears
0,5040,12420,Male,96,329,15,14.0,27
1,6300,12060,Male,82,357,15,72.0,29
2,6000,15120,Male,67,315,15,35.5,26
3,6000,16320,Male,97,354,12,24.0,29
4,6000,12300,Male,66,351,12,56.0,29
5,6840,10380,Male,92,374,15,41.5,31
6,8100,13980,Male,66,369,16,54.5,30
...,...,...,...,...,...,...,...,...
86,5100,10560,Female,84,458,12,36.0,38
87,4800,9240,Female,84,571,16,214.0,47


## Summarizing Data

Pandas offers many functions for summarizing data. To quickly get an overview of the entire DataFrame, you can use the `.describe()` method.

In [40]:
data.describe()

Unnamed: 0,Bsal,Sal77,Senior,Age,Educ,Exper,AgeInYears
count,93.0,93.0,93.0,93.0,93.0,93.0,93.0
mean,5420.322581,10392.903226,82.27957,474.397849,12.505376,100.927419,39.129032
std,709.587222,1789.640831,10.254761,140.210489,2.282369,90.946985,11.684462
min,3900.0,7860.0,65.0,280.0,8.0,0.0,23.0
25%,4980.0,9000.0,74.0,349.0,12.0,35.5,29.0
50%,5400.0,10020.0,84.0,468.0,12.0,70.0,39.0
75%,6000.0,11220.0,90.0,590.0,15.0,144.0,49.0
max,8100.0,16320.0,98.0,774.0,16.0,381.0,64.0


What is the type of this object? How would you extract the mean salary in 1977?

In [42]:
# ENTER YOUR CODE HERE
data.describe().loc["mean", "Sal77"]

10392.903225806451

Notice that the above description only included the quantitative variables. We can also apply the `.describe()` method  individual variables. Let's see what happens when we apply it to a categorical variable.

In [43]:
data["Sex"].describe()

count         93
unique         2
top       Female
freq          61
Name: Sex, dtype: object

We can also calculate specific summary statistics directly.

In [44]:
data["Bsal"].mean(), data["Bsal"].std()

(5420.322580645161, 709.58722173207275)

## Exercises

**Exercise 1.** Calculate the average beginning salary (`Bsal`) for men. Then do the same for women. How do they compare? (For a full analysis, you should also look at the standard deviations.)

In [53]:
# ENTER YOUR CODE HERE
print("Male:", data[data.Sex == "Male"]["Bsal"].mean())
print("Female:", data[data.Sex == "Female"]["Bsal"].mean())

Male: 5956.875
Female: 5138.85245902


**Exercise 2.** Add a column called `SalIncrease` that represents how much each employee's salary increased between when they started and 1977.

In [57]:
# ENTER YOUR CODE HERE
data["SalIncrease"] = data.Sal77 - data.Bsal
data["SalIncrease"]

0      7380
1      5760
2      9120
3     10320
4      6300
5      3540
6      5880
      ...  
86     5460
87     4440
88     5940
89     5640
90     2280
91     4800
92     3120
Name: SalIncrease, dtype: int64

**Exercise 3.** `Educ` represents how many years of education an employee had. `Exper` represents how many months of experience an employee had _before_ coming to Harris Bank. `Senior` represents how many months an employee had been at the company.

The sum of an employee's education, experience, and seniority should not exceed their age! Let's sanity check the data. Has anyone in the data set been studying / working longer than they've been alive?

In [80]:
# ENTER YOUR CODE HERE
check = ((data.Educ*12 + data.Exper + data.Senior) > data.Age)
check.sum()

1

In [81]:
data[check]

Unnamed: 0,Bsal,Sal77,Sex,Senior,Age,Educ,Exper,AgeInYears,SalIncrease
78,4440,9600,Female,97,341,15,75.0,28,5160
