## Learning objectives

* Explain the Pandas data structures of DataFrames and Series
* Perform basic manipulations of data structures in Pandas.
	* Creating new columns
	* Selecting rows based on conditions
* Practice reading in files from CSVs and dealing with common issues that arise.

## Pandas Introduction and Uses

### What is Pandas?

__Overview:__
- __[Pandas](http://pandas.pydata.org/pandas-docs/stable/index.html):__ Pandas is a Python package that provides fast and flexible data structures that are designed to make working with [relational](https://en.wikipedia.org/wiki/Relational_database) or "labeled" data easy and intuitive
- In the words of [Wes McKinney](https://en.wikipedia.org/wiki/Wes_McKinney), who created Pandas in 2008, and published [this](http://www.dlr.de/sc/Portaldata/15/Resources/dokumente/pyhpc2011/submissions/pyhpc2011_submission_9.pdf) paper in 2011 at PyHPC describing the usefulness and need for Pandas (which was a play on [__Pan__ el __Da__ ta](https://en.wikipedia.org/wiki/Panel_data) )

_"Pandas enables people to analyze and work with data who are not expert computer scientists...the code is intuitive and accessible. Pandas helps people move beyond just using Excel for data analysis"_ 

- When Python was first developed, it was very difficult to perform tasks such as importing CSV files, dealing with spreadsheet-like datasets with rows and columns and merging tables 
- Therefore, Pandas was developed to solve these problems and with the introduction of the DataFrame, Pandas made it possible to do intuitive analysis and exploration in Python that was not possible and still not possible in other languages 
- In recent years, the Pandas Package has become a staple in the Data Scientist's toolbox for some of the following reasons. In fact, Python is one of the most popular programming languages for Data Scientists specifically because of packages such as Pandas, Numpy, and Matplotlib.
> 1. As a Data Scientist, it is common to work with tabular data where the data in each column is different, known as __hetereogenously-typed data__ (similar to a SQL table or Excel spreadsheet). Pandas DataFrame replicates tabular data and allows you to do everything you would in a spreadsheet, but better and faster 
> 2. As a Data Scientist, it is common to work with time series data that may be ordered or unordered and Pandas has extensive capabilities to treat dates, times, etc. 
> 3. The most time-consuming part of any Data Scientist's job is __[Data Munging](https://en.wikipedia.org/wiki/Data_wrangling)__ (Data Cleaning/Wrangling) and Pandas provides all the necessary tools at your fingertips to do this quicker and cleaner 
> 4. Exploratory Data Analysis is often overlooked in Data Science, but remains one of the most important tasks of a Data Scientist and Pandas provides many easy and intuitive methods to perform data manipulation 

- Now you understand why Data Scientists use the Pandas Package, but what is it about the Pandas Package that allow us, as users of the Pandas Package, to realize these benefits? 
> 1. __Missing Data:__ Pandas handles missing data well (represented as NaN type from Numpy)
> 2. __Size Mutability:__ Pandas DataFrames are size mutable which means columns and rows can be inserted and deleted 
> 3. __Data Aligment:__ Pandas allows you to align an object to a specific set of labels OR allow Pandas align the data for you
> 4. __Grouping Data:__ Pandas `pivot_table` function allows both aggregating and transforming of data 
> 5. __Data Access:__ Pandas has extensive capabilities of slicing, indexing and subsetting large data sets 
> 6. __Reshaping Data:__ Pandas has extensive capabilities of merging, joining, and reshaping data 
> 7. __Input/Output:__ Pandas allows easy import and export of flat files such as CSV 
> 8. __Time-Series:__ Pandas has specific Time-Series functionality to work with dates

### Pandas Data Structures 

__Overview:__ 
- Recall that the usefulness of Pandas has to do with its fundamental data structures
- There are 2 types of data structures in Pandas:
> 1. [`Series`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html#pandas.Series): Series is a one-dimensional labeled array that is capable of holding any data type (i.e. `int`, `str`, `float`, etc.), but every element is of this same type. The axis labels of a Series are referred to as the __Index__ of the Series
> 2. [`DataFrame`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html): Dataframe is a two-dimensional labeled data structure with columns of potentially different types. It largely resembles a spreadsheet or SQL table. The first axis labels of a Dataframe (rows) are referred to as the __Index__ of the Series, whereas the second axis labels labels of a Dataframe (columns) are referred to as the __Columns__ of the Series

![](img/dataframe.png)

In [1]:
# ignore warnings
import warnings
warnings.filterwarnings('ignore')

# imports
import pandas as pd
import numpy as np 

## Creating DataFrames

There are several ways to create DataFrames. The way you'll work with most commonly is via a CSV or Excel file:

### Reading from CSV

Let's look at the CSV file (we can even use Excel for this!)

```
,First,Last,Age
Employee_1,Maya,Midzik,50
Employee_2,Jonathan,Balaban,25
Employee_3,Jerod,Rubalcava,31
```

In [2]:
df = pd.read_csv("resources/csv_file_for_pandas.csv") 
df

Unnamed: 0.1,Unnamed: 0,First,Last,Age
0,Employee_1,Maya,Midzik,50
1,Employee_2,Jonathan,Balaban,25
2,Employee_3,Jerod,Rubalcava,31


That "Unnamed: 0" column isn't what we want - we want _that_ to be our index!

#### Mini-exercise

Use "?" to look up the documentation of the Pandas `read_csv` file and see which argument could help us deal with that "Unnamed: 0" column.

In [3]:
pd.read_csv?

In [24]:
df = pd.read_csv("resources/csv_file_for_pandas.csv",index_col = 0)

In [25]:
df

Unnamed: 0,First,Last,Age
Employee_1,Maya,Midzik,50
Employee_2,Jonathan,Balaban,25
Employee_3,Jerod,Rubalcava,31


### Columns = `Series`

Individual DataFrame columns are Pandas `Series`:

In [26]:
df['First']

Employee_1        Maya
Employee_2    Jonathan
Employee_3       Jerod
Name: First, dtype: object

In [8]:
type(df['First'])

pandas.core.series.Series

You might notice that this syntax is similar to dictionary syntax. That is because DataFrames are really fancy wrappers around dictionaries.

In fact, you can initialize DataFrames using dictionary syntax:

### DataFrame from dictionary

In [27]:
# This example also shows that each "type" in a dictionary. 
my_dict = {"ndarray":np.arange(4),
           "List":[10,12,1,2],
           "Series":pd.Series('a', index = ["row_1", "row_2", "row_3", "row_4"])}
my_dict

{'ndarray': array([0, 1, 2, 3]), 'List': [10, 12, 1, 2], 'Series': row_1    a
 row_2    a
 row_3    a
 row_4    a
 dtype: object}

In [28]:
pd.DataFrame(my_dict)

Unnamed: 0,ndarray,List,Series
row_1,0,10,a
row_2,1,12,a
row_3,2,1,a
row_4,3,2,a


Now let's inspect DataFrames in more detail:

### "Inspection functions" for DataFrames

In [29]:
my_df = pd.read_csv("resources/csv_file_for_pandas.csv", index_col = 0)
my_df

Unnamed: 0,First,Last,Age
Employee_1,Maya,Midzik,50
Employee_2,Jonathan,Balaban,25
Employee_3,Jerod,Rubalcava,31


### Dimensions:

In [30]:
len(my_df)

3

In [31]:
my_df.shape

(3, 3)

In [32]:
my_df.size

9

### Row Labels:

In [33]:
my_df.index

Index(['Employee_1', 'Employee_2', 'Employee_3'], dtype='object')

In [34]:
my_df.index.tolist()

['Employee_1', 'Employee_2', 'Employee_3']

### Column Labels:

In [35]:
print(my_df.columns)
print("\ndata type: ",type(my_df.columns))

Index(['First', 'Last', 'Age'], dtype='object')

data type:  <class 'pandas.core.indexes.base.Index'>


In [36]:
print(my_df.columns.values)
print("\ndata type: ",type(my_df.columns.values))

['First' 'Last' 'Age']

data type:  <class 'numpy.ndarray'>


In [37]:
my_df.columns.values.tolist()

['First', 'Last', 'Age']

In [38]:
print(my_df.columns.tolist())

print("\ndata type: ",type(my_df.columns.tolist()))

['First', 'Last', 'Age']

data type:  <class 'list'>


Note: Useful to remember that what can be done with a Python object depends on the datatype

In [39]:
my_df

Unnamed: 0,First,Last,Age
Employee_1,Maya,Midzik,50
Employee_2,Jonathan,Balaban,25
Employee_3,Jerod,Rubalcava,31


In [40]:
# For example, the pandas index is immutable, so you cannot directly replace an item
my_df.columns[0] = 'first_old'

TypeError: Index does not support mutable operations

In [41]:
# If you use the values attribute, you will be able to actually change a value inplace
my_df.columns.values[0] = 'first_old'
#my_df.columns.values[1] = ['a','b']

In [42]:
my_df

Unnamed: 0,first_old,Last,Age
Employee_1,Maya,Midzik,50
Employee_2,Jonathan,Balaban,25
Employee_3,Jerod,Rubalcava,31


In [43]:
# Whereas changing to list will not allow you to change anything in place
# but would be a useful way to work with the column names using the 
# data manipulations available for the list data structure
col_list = my_df.columns.tolist()
col_list[1] = 'last_old'
print(col_list)

['first_old', 'last_old', 'Age']


In [44]:
my_df.head()

Unnamed: 0,first_old,Last,Age
Employee_1,Maya,Midzik,50
Employee_2,Jonathan,Balaban,25
Employee_3,Jerod,Rubalcava,31


In [45]:
my_df.rename(columns = {'first_old':'First'},inplace=True)

### Data Types:

In [46]:
my_df.dtypes

First    object
Last     object
Age       int64
dtype: object

### Data Quick Look:

In [47]:
my_df.head(2)

Unnamed: 0,First,Last,Age
Employee_1,Maya,Midzik,50
Employee_2,Jonathan,Balaban,25


In [48]:
my_df.tail(1)

Unnamed: 0,First,Last,Age
Employee_3,Jerod,Rubalcava,31


In [49]:
my_df.sample(2)

Unnamed: 0,First,Last,Age
Employee_2,Jonathan,Balaban,25
Employee_3,Jerod,Rubalcava,31


### Data Summary:

Only summarizes the numeric variables.

In [50]:
my_df.describe()

Unnamed: 0,Age
count,3.0
mean,35.333333
std,13.051181
min,25.0
25%,28.0
50%,31.0
75%,40.5
max,50.0


### Info:

In [51]:
my_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, Employee_1 to Employee_3
Data columns (total 3 columns):
First    3 non-null object
Last     3 non-null object
Age      3 non-null int64
dtypes: int64(1), object(2)
memory usage: 96.0+ bytes


### Mini-exercise

Read in the data in the `population_data.csv` CSV file into a DataFrame called *prob2_df*. Print out the:
    
* Number of rows and columns
* Column Names
* A sample of 2 rows
* A summary of the values

In [52]:
prob2_df = pd.read_csv("resources/population_data.csv")


In [53]:
prob2_df.head(3)

Unnamed: 0,Country Name,Population,Size
0,China,1409517397,9572900
1,India,1339180127,3287263
2,USA,324459463,9629091


In [None]:
(prob2_df).shape

In [None]:
prob2_df.columns


In [55]:
prob2_df.sample(2)


Unnamed: 0,Country Name,Population,Size
3,Indonesia,263991379,1904556
0,China,1409517397,9572900


In [56]:

prob2_df.describe()

Unnamed: 0,Population,Size
count,5.0,5.0
mean,709287300.0,6581155.0
std,608988300.0,3697593.0
min,209288300.0,1904556.0
25%,263991400.0,3287263.0
50%,324459500.0,8511965.0
75%,1339180000.0,9572900.0
max,1409517000.0,9629091.0


### Creating new columns

We create new columns in a DataFrame similarly to how we create new entries in a dictionary:

In [57]:
my_df

Unnamed: 0,First,Last,Age
Employee_1,Maya,Midzik,50
Employee_2,Jonathan,Balaban,25
Employee_3,Jerod,Rubalcava,31


In [58]:
my_df['Salary_000s'] = [80, 70, 60]
my_df

Unnamed: 0,First,Last,Age,Salary_000s
Employee_1,Maya,Midzik,50,80
Employee_2,Jonathan,Balaban,25,70
Employee_3,Jerod,Rubalcava,31,60


We can also create new columns as functions of individual DataFrame columns:

In [59]:
my_df['Salary'] = my_df['Salary_000s'] * 1000
my_df

Unnamed: 0,First,Last,Age,Salary_000s,Salary
Employee_1,Maya,Midzik,50,80,80000
Employee_2,Jonathan,Balaban,25,70,70000
Employee_3,Jerod,Rubalcava,31,60,60000


Side note: we can easily create columns as functions of other columns because "under the hood" Pandas is using Numpy arrays to store its data.

#### Creating columns as functions of multiple columns

Just like you can with Excel formulas, you can create columns in Pandas DataFrames that are functions of multiple columns of the original Dataframe, rather than just one. The way to do it is to:

1. Define a function that operates on a "row"
2. Apply that function to every row in the DataFrame

#### Example:

In [60]:
def young_and_rich(row):
    if row['Age'] < 28 and row['Salary_000s'] > 65:
        return True
    else:
        return False


In [61]:

my_df['young_and_rich'] = my_df.apply(young_and_rich, axis=1)    

In [62]:
my_df['age_squared'] = my_df['Age'].apply(lambda x: x**2 if x > 30 else x)

In [63]:
my_df

Unnamed: 0,First,Last,Age,Salary_000s,Salary,young_and_rich,age_squared
Employee_1,Maya,Midzik,50,80,80000,False,2500
Employee_2,Jonathan,Balaban,25,70,70000,True,25
Employee_3,Jerod,Rubalcava,31,60,60000,False,961


For an explanation of `apply` see
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html

### Aside on Python functions: `lambda`, `map`, `filter`, `zip`

__Overview:__
- __[Lambda Expressions](https://docs.python.org/3/tutorial/controlflow.html#lambda-expressions):__ Lambda Expressions are small, anonymous functions that can be created with the `lambda` keyword  
- Lambda Expressions are just another tool for building functions. In summary, we can build a function in Python using one of the following methods:
> 1. `def` keyword as we saw in last lecture
> 2. `lambda` keyword as explained here
- Functions built using Lambda Expressions have a few dominating characteristics:
> 1. __Anonymous:__ Lambda Functions are __[Anonymous Functions](https://en.wikipedia.org/wiki/Anonymous_function)__ which means they do not require a name to be used immediately and can be developed without a proper definition (like we needed with `def` earlier)
> 2. __Ad-Hoc:__ Lambda Functions are used in an "ad-hoc" fashion. This means that we only create the function when we need it, use it immediately, and never use it again
> 3. __Short:__ Lambda Functions support only minimal input, requiring that the body of the function is short
- Lambda Expressions are NOT mandatory and we can definitely do without them, but in some scenarios they can be useful for writing cleaner and more efficient code 

__Helpful Points:__
1. Lambdas are restrictive because, strictly speaking, they can take only a single __[expression](https://docs.python.org/3/reference/expressions.html)__ (expressions "represent" something like a number or string and any value is an expression vs. a __[statement](https://docs.python.org/2/reference/simple_stmts.html)__ which is "doing" something like assigning a value to some variable)
2. Lambda Functions can also take multiple inputs (just like tradtional functions) in the following way: `lambda input_1, input_2: <expression>`

__Practice:__ Examples of Lambda Expressions in Python

In [64]:
# Example of lambda
special = lambda x, y : x + y**2 
sum = special(2, 4)
print(sum)



18


__Overview:__
- __[Map Function](https://docs.python.org/2/library/functions.html#map):__ Map is a built-in Python function and is useful for applying a function to every item of an `iterable` (i.e. sequence such as `list`, `str`, etc.) and returns a list of the results 
- The general form of the `map()` function is the following: `map(function, iterable, ...)`
- Map functions make it easier to perform a function on every element of a sequence as opposed to wrapping this in a `for` loop and then applying the function on every iteration, for example (see Part 1 examples below) 

__Helpful Points:__
1. The `map()` function can have more than one `iterable` passed into as long as the `function` requires this many arguments
2. If the `function` argument is `None`, the __[Identity Function](https://en.wikipedia.org/wiki/Identity_function)__ is assumed which returns the `iterable` as is (doesn't change its elements) 
3. Map Functions are commonly used in conjunction with Lambda Expressions

__Practice:__ Example of Map Functions in Python 

In [65]:
# Example of map
# Apply function to each element in list
square = map(lambda x : x**2, [1, 2, 3, 4]) 
print(list(square))

# Example with two inputs
mult = map(lambda x, y : x*y, [1, 2, 3, 4],[1,0.5,1,0.5]) 
print(list(mult))



[1, 4, 9, 16]
[1, 1.0, 3, 2.0]


__Overview:__
- __[Filter Function](https://docs.python.org/2/library/functions.html#filter):__ Filter is a built-in function and is useful for a constructing a list of elements from the `iterable` argument for which the `function` returned `True` (it filters out all the elements of the `iterable` that were evaluated as `False`)  
- The general form of the `filter()` function is the following: `filter(function, iterable)`
- The `function` used in the first argument must return a Boolean Value (`True` or `False`) 
- Similar to Map functions, Filter functions make it easier to perform a function on every element of a sequence as opposed to wrapping this in a `for` loop and then applying the function on every iteration, for example (see Part 1 examples below) 

__Helpful Points:__
1. Similar to the Map function, if the `function` argument is `None`, the __[Identity Function](https://en.wikipedia.org/wiki/Identity_function)__ is assumed which returns the `iterable` as is (doesn't change its elements) 
2. Remember, Filter Functions are commonly used in conjunction with Lambda Expressions

__Practice:__ Example of Filter Functions in Python 

In [66]:
# Example of filter
# Output results for which condition is true
evens = filter(lambda x : x % 2 == 0, [1, 2, 3, 4, 5, 6]) 
print(list(evens))



[2, 4, 6]


__Overview:__
- __[Zipping](http://python-reference.readthedocs.io/en/latest/docs/functions/zip.html)__: Zipping is a convenient feature in Python that allows you to combine 2 or more sequences, into a single sequence
- The new sequence consists of a list of `n-tuples` (where the i-th tuple contains the i-th element from each of the argument sequences) and `n` is the number of sequences which corresponds to the length of the list
- For example, 2 objects that are of type `list`, can be "zipped" together and the resulting list will be a `tuple` looking like this: `[(element 0 of list 1, element 0 of list 2), (element 1 of list 1, element 1 of list 2), ...]`
- __Unzipping:__ Unzipping is the opposite of the __Zipping__ feature and is performed by using the `*` operator 

__Helpful Points:__
1. The term "zipping" is most commonly used to __["zip"](https://en.wikipedia.org/wiki/Zip_(file_format))__ files which means to "compress" a series of files. In Python, the interpretation is the same (but we are compressing sequences, not files)
2. If the sequences that are passed in are not of equal length, the returned list is truncated to the length of the shortest sequence
3. When using the `zip()` function directly, the result will not automatically be a `list`, this is something you need to force by using the `list()` function
4. Zipping is very useful (and common) for iterating over multiple sequences at once (see Part 3 in examples below)

__Practice:__ Examples of Zipping and Unzipping in Python 

In [67]:
# Example of zip
# Pair elements in list
list_1 = [1, 2, 3, 10]
list_2 = [3, 5, 7, 9]
pairs = list(zip(list_1, list_2))
print(pairs)

[(1, 3), (2, 5), (3, 7), (10, 9)]


In [68]:
# Example of unzip: zip in reverse
first_list, second_list =  zip(*pairs)
print('First_list =', first_list)
print('Second_list =', second_list)

First_list = (1, 2, 3, 10)
Second_list = (3, 5, 7, 9)


In [69]:
# More advanced example---putting it all together
# Find maximum in each pair of numbers
list(map(lambda pair: max(pair), zip(list_1, list_2)))

[3, 5, 7, 10]

### Boolean Indexing - DataFrames:

Boolean indexing lets you select a subset of rows from a DataFrame based on some condition. It relies on the fact that you can test on conditions of a Pandas Series element-wise, resulting in a series of Booleans:

In [70]:
my_df

Unnamed: 0,First,Last,Age,Salary_000s,Salary,young_and_rich,age_squared
Employee_1,Maya,Midzik,50,80,80000,False,2500
Employee_2,Jonathan,Balaban,25,70,70000,True,25
Employee_3,Jerod,Rubalcava,31,60,60000,False,961


In [71]:
my_df["Age"] > 30

Employee_1     True
Employee_2    False
Employee_3     True
Name: Age, dtype: bool

Then, you can use that to subset:

In [72]:
my_df[my_df["Age"] > 30] # all rows with age greater than 30

Unnamed: 0,First,Last,Age,Salary_000s,Salary,young_and_rich,age_squared
Employee_1,Maya,Midzik,50,80,80000,False,2500
Employee_3,Jerod,Rubalcava,31,60,60000,False,961


In [73]:
(my_df["Age"] > 30) & (my_df["Age"] < 50)

Employee_1    False
Employee_2    False
Employee_3     True
Name: Age, dtype: bool

In [74]:
my_df

Unnamed: 0,First,Last,Age,Salary_000s,Salary,young_and_rich,age_squared
Employee_1,Maya,Midzik,50,80,80000,False,2500
Employee_2,Jonathan,Balaban,25,70,70000,True,25
Employee_3,Jerod,Rubalcava,31,60,60000,False,961


In [75]:
# all rows with age greater than 30 and less than 50
my_df[(my_df["Age"] > 49) | (my_df["Salary"] ==70000)] 

Unnamed: 0,First,Last,Age,Salary_000s,Salary,young_and_rich,age_squared
Employee_1,Maya,Midzik,50,80,80000,False,2500
Employee_2,Jonathan,Balaban,25,70,70000,True,25


In [76]:
my_df

Unnamed: 0,First,Last,Age,Salary_000s,Salary,young_and_rich,age_squared
Employee_1,Maya,Midzik,50,80,80000,False,2500
Employee_2,Jonathan,Balaban,25,70,70000,True,25
Employee_3,Jerod,Rubalcava,31,60,60000,False,961


In [77]:
# all rows where first name is not equal to Jerod 
my_df[~(my_df["First"] == "Jerod")] 

Unnamed: 0,First,Last,Age,Salary_000s,Salary,young_and_rich,age_squared
Employee_1,Maya,Midzik,50,80,80000,False,2500
Employee_2,Jonathan,Balaban,25,70,70000,True,25


In [78]:
# all rows where first name is not equal to Jerod 
my_df[my_df["First"] != "Jerod"] 

Unnamed: 0,First,Last,Age,Salary_000s,Salary,young_and_rich,age_squared
Employee_1,Maya,Midzik,50,80,80000,False,2500
Employee_2,Jonathan,Balaban,25,70,70000,True,25


### Exercise

Using the same data from the population dataframe below:

* Create a new column "density", that is the population divided by the size
* Filter the DataFrame to only include rows with population less than 1000000000.
* Sort the countries from highest to lowest density
    * For this second part: you'll have to Google how to do this. You'll get used to this quickly, and it isn't as bad as you might think. Try Googling "sort pandas dataframe" and I bet the first result you get will be a link to the documentation that you can use to show you how to do this. Once you realize you can "just Google stuff", a whole world opens up for you!

In [79]:
prob2_df

Unnamed: 0,Country Name,Population,Size
0,China,1409517397,9572900
1,India,1339180127,3287263
2,USA,324459463,9629091
3,Indonesia,263991379,1904556
4,Brazil,209288278,8511965


In [80]:
prob2_df['Density'] = prob2_df['Population'] / prob2_df['Size']
prob2_df

Unnamed: 0,Country Name,Population,Size,Density
0,China,1409517397,9572900,147.240376
1,India,1339180127,3287263,407.384541
2,USA,324459463,9629091,33.695752
3,Indonesia,263991379,1904556,138.610458
4,Brazil,209288278,8511965,24.58754


In [81]:
prob2_df[prob2_df['Population'] < 1000000000]

Unnamed: 0,Country Name,Population,Size,Density
2,USA,324459463,9629091,33.695752
3,Indonesia,263991379,1904556,138.610458
4,Brazil,209288278,8511965,24.58754


In [82]:
prob2_df.sort_values(by='Density',ascending=False,inplace=True)

In [83]:
prob2_df.reset_index(drop=True)

Unnamed: 0,Country Name,Population,Size,Density
0,India,1339180127,3287263,407.384541
1,China,1409517397,9572900,147.240376
2,Indonesia,263991379,1904556,138.610458
3,USA,324459463,9629091,33.695752
4,Brazil,209288278,8511965,24.58754


In [84]:
# Sort on full data (not filtered by population)
prob2_df.sort_values(by='Density', ascending=False)

Unnamed: 0,Country Name,Population,Size,Density
1,India,1339180127,3287263,407.384541
0,China,1409517397,9572900,147.240376
3,Indonesia,263991379,1904556,138.610458
2,USA,324459463,9629091,33.695752
4,Brazil,209288278,8511965,24.58754


### Select by label (loc) and position (iloc)


- __[Select by Label](http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-label) (loc):__ This describes __purely label-based indexing__ which selects labels based on what is included in the index of the object. The syntax of this method (for DataFrames) is `df.loc[row_label, column_label]` and it returns a `Series` object. The following provides a list of the possible arguments for the `row_label` and/or `column_label`:
>> a. A single label (i.e. `5` or `a`, but the number `5` is interpreted as the label and NOT as the index (use `.iloc` for this)<br>
>> b. A list or array of labels (i.e. `['a', 'b', 'c']`)<br>
>> c. A slice object with labels (i.e. `['a':'f']`, but unlike with other slices in Python, the `stop` argument is included in the slice)  
>> d. A boolean array 
- __[Select by Integer Location (Position)](http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-integer) (iloc):__ This describes __purely integer-based indexing__ which requires an integer (0-based indexing) for input. The syntax of this method is (for DataFrames) `df.iloc[row_number, column_number]` and it returns a `Series` object. The following provides a list of the possible arguments for `row_number` and/or `column_number`:
>> a. An integer (i.e. `5`)<br>
>> b. A list or array of integers (i.e. `[4, 3, 0]`<br>
>> c. A slice object with integers (i.e. `[1:7]`<br>
>> d. A boolean array 

In [85]:
# Let's go ahead and make another copy of my_df called new_df
new_df = my_df.copy()
new_df.head()

Unnamed: 0,First,Last,Age,Salary_000s,Salary,young_and_rich,age_squared
Employee_1,Maya,Midzik,50,80,80000,False,2500
Employee_2,Jonathan,Balaban,25,70,70000,True,25
Employee_3,Jerod,Rubalcava,31,60,60000,False,961


### loc

In [86]:
# loc method allows us to select rows and columns 
# by their label or boolean logic introduced above
# select a specific value using row and column names
new_df.loc['Employee_2','Last'] 

'Balaban'

In [87]:
#change values for specific location
new_df.loc['Employee_2','Last'] = 'Carlos' 
new_df

Unnamed: 0,First,Last,Age,Salary_000s,Salary,young_and_rich,age_squared
Employee_1,Maya,Midzik,50,80,80000,False,2500
Employee_2,Jonathan,Carlos,25,70,70000,True,25
Employee_3,Jerod,Rubalcava,31,60,60000,False,961


In [88]:
#select a subset of the rows and columns
new_df.loc['Employee_1':'Employee_3',['First','young_and_rich']] 

Unnamed: 0,First,young_and_rich
Employee_1,Maya,False
Employee_2,Jonathan,True
Employee_3,Jerod,False


In [89]:
# We can also use the subset method introduced 
# for Boolean operations within loc method
# get first name and age of all individuals over the age of 30 
new_df.loc[new_df['Age'] > 30,['First','Age']] 

Unnamed: 0,First,Age
Employee_1,Maya,50
Employee_3,Jerod,31


In [90]:
# Still have full dataframe
new_df

Unnamed: 0,First,Last,Age,Salary_000s,Salary,young_and_rich,age_squared
Employee_1,Maya,Midzik,50,80,80000,False,2500
Employee_2,Jonathan,Carlos,25,70,70000,True,25
Employee_3,Jerod,Rubalcava,31,60,60000,False,961


### iloc

In [91]:
# iloc method allows us to select rows and columns 
# by their integer location
new_df.iloc[1,0] #this will output row 1 and column 0

'Jonathan'

In [92]:
# similar to loc, you can also pull a subset of rows and columns
new_df.iloc[[0,2,1],0:2] 

Unnamed: 0,First,Last
Employee_1,Maya,Midzik
Employee_3,Jerod,Rubalcava
Employee_2,Jonathan,Carlos


In [93]:
# change values for specific locations
new_df.iloc[1,1] = 'Reif' 
new_df

Unnamed: 0,First,Last,Age,Salary_000s,Salary,young_and_rich,age_squared
Employee_1,Maya,Midzik,50,80,80000,False,2500
Employee_2,Jonathan,Reif,25,70,70000,True,25
Employee_3,Jerod,Rubalcava,31,60,60000,False,961


### Exercise
* Maya Midzick just had her birthday! Add one year to her age using loc or iloc
* Select all odd rows and even columns from the new_df dataframe
    * Hint: Think for how to use step through when indexing a list

In [94]:
# If statement to prevent one is added every time cell is run
if new_df.loc['Employee_1','Age'] < 51:
    new_df.loc['Employee_1','Age']+=1

In [95]:
new_df.loc['Employee_1','Age']+=1

In [96]:
new_df.iloc[1::2,::2]

Unnamed: 0,First,Age,Salary,age_squared
Employee_2,Jonathan,25,70000,25


### File I/O

Often you'll begin your Data Science workflow by reading in a CSV or Excel file. 

In [97]:
filepath = "resources/melbourne_temperature.csv"
temp_data = pd.read_csv(filepath)

In [98]:
temp_data.head()

Unnamed: 0,Date,"Daily minimum temperatures in Melbourne, Australia, 1981-1990"
0,1981-01-01,20.7
1,1981-01-02,17.9
2,1981-01-03,18.8
3,1981-01-04,14.6
4,1981-01-05,15.8


Long column name, let's clean it:

#### Renaming columns

In [99]:
temp_data.columns = ['Date', 'Min_Melbourne_Temp']
temp_data.head()

Unnamed: 0,Date,Min_Melbourne_Temp
0,1981-01-01,20.7
1,1981-01-02,17.9
2,1981-01-03,18.8
3,1981-01-04,14.6
4,1981-01-05,15.8


If we don't want to rename _all_ the columns in the DataFrame, we can write:

In [100]:
# Read in data
temp_data = pd.read_csv(filepath)

# Rename columns
temp_data = temp_data.rename(
    columns={"Daily minimum temperatures in Melbourne, Australia, 1981-1990": "Min_Melbourne_Temp"})

# Resulting data
temp_data.head()

Unnamed: 0,Date,Min_Melbourne_Temp
0,1981-01-01,20.7
1,1981-01-02,17.9
2,1981-01-03,18.8
3,1981-01-04,14.6
4,1981-01-05,15.8


A good thing to do to check that your data is "clean" is to print out the data types of the columns and see if they agree with what you expect:

In [101]:
temp_data.dtypes

Date                  object
Min_Melbourne_Temp    object
dtype: object

We notice that the temperature column, which "should" be a number, is being interpreted as the generic type "object". Something is going wrong:

In [102]:
temp_data['Min_Melbourne_Temp'].astype(float)

ValueError: could not convert string to float: '?0.2'

Error! Looks like there are some values with "?"s in them. 

In general, there are a many ways to clean dirty data. For now, we'll ask you to define a function that identifies rows with question marks. 

### Cleanup

Create a new column in this DataFrame that identifies if a row has a question mark in it. There are a bunch of ways to do it - consider the methods for creating new columns that we've seen so far, and feel free to Google.

In [103]:
temp_data['has_question_mark'] = temp_data['Min_Melbourne_Temp'].map(lambda x: '?' in str(x))


Now use Pandas' **[Boolean Indexing](https://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing)** (covered above, but linked for your reference) to select just the rows the do _not_ have a question mark.

In [104]:
temp_data = temp_data[~temp_data['has_question_mark']]
print(len(temp_data))

3647


In [105]:
# Now you should be able to convert this column to float
temp_data['Min_Melbourne_Temp_new'] = temp_data['Min_Melbourne_Temp'].astype(float)

In [106]:
temp_data.dtypes

Date                       object
Min_Melbourne_Temp         object
has_question_mark            bool
Min_Melbourne_Temp_new    float64
dtype: object

Now we can do all the things that we do with numeric columns to this column, such as plotting the results:

In [107]:
temp_data['Min_Melbourne_Temp_new'].plot();

Oftentimes you'll get a CSV or Excel file with a date column. By default, Pandas will treat this like a string. `read_csv` has a `parse_dates` argument that avoids this:

In [108]:
temp_data_clean = pd.read_csv(filepath, parse_dates=[0])
temp_data_clean.dtypes

Date                                                             datetime64[ns]
Daily minimum temperatures in Melbourne, Australia, 1981-1990            object
dtype: object

In [109]:
temp_data_clean.head()

Unnamed: 0,Date,"Daily minimum temperatures in Melbourne, Australia, 1981-1990"
0,1981-01-01,20.7
1,1981-01-02,17.9
2,1981-01-03,18.8
3,1981-01-04,14.6
4,1981-01-05,15.8


Great, but what have we gained by converting this column's type to "datetime"? Answer: now that this column is not merely a string, but is date-aware, we can do things like this:

In [112]:
temp_data_clean['month'] = [x.month for x in temp_data_clean['Date']]
temp_data_clean['day_of_week'] = [x.dayofweek for x in temp_data_clean['Date']]

In [113]:
temp_data_clean['month'] = temp_data_clean['Date'].apply(lambda x: x.dayofweek)

In [114]:
temp_data_clean.head()

Unnamed: 0,Date,"Daily minimum temperatures in Melbourne, Australia, 1981-1990",month,day_of_week
0,1981-01-01,20.7,3,3
1,1981-01-02,17.9,4,4
2,1981-01-03,18.8,5,5
3,1981-01-04,14.6,6,6
4,1981-01-05,15.8,0,0


In [115]:
temp_data_clean.to_csv("resources/new_temperature_df.csv", index=False)

Useful for time series forecasting.

### Exercise

Now that you've gone through some common data cleaning scenarios when dealing with CSV files, let's "wrap this in a function".

Write a function that takes in the `melbourne_temperature.csv` file, conducts the data cleaning we did here, including creating the `month` and `dayofweek` datetime columns above, and outputs the result to a CSV file called `clean_melbourne_temperature.csv`. Call your function something appropriate.

Start by creating a wrapper function and check that it works

In [116]:
def clean_data_file(location):
    print('Testing function')
    return

In [117]:
clean_data_file('resources/melbourne_temperature.csv')

Testing function


Now create the function fully. To illustrate what Python offers, we will go beyond a simple wrapper and add other features: (i) use both a path and a filename; (ii) create a docstring (documentation) for the function.

File paths are different on Windows and macOS. To handle file paths in a system-agnostic manner, use the Python library pathlib.

In [118]:
from pandas import read_csv

In [120]:
from pathlib import Path

In [121]:
Path('resources')

PosixPath('resources')

In [122]:
Path('resources')/ 'melbourne_temperature.csv'

PosixPath('resources/melbourne_temperature.csv')

For more on file paths in Python 3:

https://medium.com/@ageitgey/python-3-quick-tip-the-easy-way-to-deal-with-file-paths-on-windows-mac-and-linux-11a072b58d5f

In [123]:
# YOUR CODE HERE
def temp_clean(filepath, filename):
    
    '''Loads data from melbourne_temperature.csv and 
    cleans it to prepare for analysis
    
    Retrieves data from CSV file (filename) 
    located in filepath.
    
    Args: 
        filepath: the pathname for the CSV file.
        filename: the filename for the CSV file.
        
    Return:
        Cleaned CSV file. Written to filepath.
    '''
    
    # Convert filepath and filename into format 
    # that will work on both macOS and Windows
    # Also, add prefix for new file name for simplicity
    # (don't have to worry about affecting .csv extension)
    
    data_folder = Path(filepath)
    file_to_open = data_folder / filename
    output_file = 'clean_' + filename # for output file
    file_to_write = data_folder / output_file
    
    # load data
    temp_data = pd.read_csv(file_to_open, parse_dates=[0])
    temp_data = temp_data.rename(
    columns={"Daily minimum temperatures in Melbourne, Australia, 1981-1990":
             "Min_Melbourne_Temp"})
    
    # remove rows with ? 
    temp_data['has_question_mark'] = temp_data['Min_Melbourne_Temp'].map(lambda x: '?' in str(x))
    temp_data = temp_data[~temp_data['has_question_mark']]
    temp_data['Min_Melbourne_Temp_new'] = temp_data['Min_Melbourne_Temp'].astype(float)
    
    
    # remove unnecessary columns
    temp_data = temp_data.drop(['Min_Melbourne_Temp', 'has_question_mark'], axis = 1)
    
    # parse date
    temp_data['month'] = [x.month for x in temp_data['Date']]
    temp_data['day_of_week'] = [x.dayofweek for x in temp_data['Date']]
    
    # write data to new file
    temp_data.to_csv(file_to_write, index=False)
    
    print('Clean data was written to {}'.format(file_to_write))
    
    return temp_data

In [124]:
df_cleansed = temp_clean('resources', 'melbourne_temperature.csv')

Clean data was written to resources/clean_melbourne_temperature.csv


In [125]:
df_cleansed

Unnamed: 0,Date,Min_Melbourne_Temp_new,month,day_of_week
0,1981-01-01,20.7,1,3
1,1981-01-02,17.9,1,4
2,1981-01-03,18.8,1,5
3,1981-01-04,14.6,1,6
4,1981-01-05,15.8,1,0
...,...,...,...,...
3645,1990-12-27,14.0,12,3
3646,1990-12-28,13.6,12,4
3647,1990-12-29,13.5,12,5
3648,1990-12-30,15.7,12,6


### Reading from Excel

We'll use another dataset for a quick illustration of reading from Excel.

Here's an energy usage dataset from the U.S. Energy Information Administration as an example. We've downloaded the "Energy consumption estimates by sector, 1949– 2012" to the "energy" data folder.

https://www.eia.gov/totalenergy/data/annual/#consumption.

In [126]:
# .read_csv() takes the path to a csv file and returns a DataFrame
df_csv = pd.read_csv('resources/energy/MER_T02_01.csv')
df_csv.head()

Unnamed: 0,MSN,YYYYMM,Value,Column_Order,Description,Unit
0,TXRCBUS,194913,4460.434,1,Primary Energy Consumed by the Residential Sector,Trillion Btu
1,TXRCBUS,195013,4829.337,1,Primary Energy Consumed by the Residential Sector,Trillion Btu
2,TXRCBUS,195113,5104.476,1,Primary Energy Consumed by the Residential Sector,Trillion Btu
3,TXRCBUS,195213,5158.193,1,Primary Energy Consumed by the Residential Sector,Trillion Btu
4,TXRCBUS,195313,5052.515,1,Primary Energy Consumed by the Residential Sector,Trillion Btu


In [127]:
# EnergyData.xlsx is a spreadsheet containing the above energy data
# There are two sheets (before2000, after2000) 
# before2000: has the data from before the year 2000
# after2000: has the data from after the year 2000
# By default read_excel only returns the first sheet
# It also seems that the first column is meant to be an index column
# By default all columns will be read in and those without headers will be labeled 'Unnamed: x'
df_excel = pd.read_excel('resources/energy/EnergyData.xlsx')
df_excel.head()

Unnamed: 0.1,Unnamed: 0,MSN,YYYYMM,Value,Column_Order,Description,Unit
0,0,TXRCBUS,194913,4460.434,1,Primary Energy Consumed by the Residential Sector,Trillion Btu
1,1,TXRCBUS,195013,4829.337,1,Primary Energy Consumed by the Residential Sector,Trillion Btu
2,2,TXRCBUS,195113,5104.476,1,Primary Energy Consumed by the Residential Sector,Trillion Btu
3,3,TXRCBUS,195213,5158.193,1,Primary Energy Consumed by the Residential Sector,Trillion Btu
4,4,TXRCBUS,195313,5052.515,1,Primary Energy Consumed by the Residential Sector,Trillion Btu


In [128]:
df_excel.shape

(4125, 7)

In [129]:
df_excel.keys()

Index(['Unnamed: 0', 'MSN', 'YYYYMM', 'Value', 'Column_Order', 'Description',
       'Unit'],
      dtype='object')

In [130]:
# the parameter "sheet_name" allows us to specify which sheet to read in
# the parameter "index_col" allows us to specify a column to use as an index column
df_excel_before2000 = pd.read_excel('resources/energy/EnergyData.xlsx', sheet_name = 'before2000',index_col=0)
df_excel_before2000.head()

Unnamed: 0,MSN,YYYYMM,Value,Column_Order,Description,Unit
0,TXRCBUS,194913,4460.434,1,Primary Energy Consumed by the Residential Sector,Trillion Btu
1,TXRCBUS,195013,4829.337,1,Primary Energy Consumed by the Residential Sector,Trillion Btu
2,TXRCBUS,195113,5104.476,1,Primary Energy Consumed by the Residential Sector,Trillion Btu
3,TXRCBUS,195213,5158.193,1,Primary Energy Consumed by the Residential Sector,Trillion Btu
4,TXRCBUS,195313,5052.515,1,Primary Energy Consumed by the Residential Sector,Trillion Btu


In [131]:
df_excel_before2000.shape

(4125, 6)

In [132]:
# the parameter "sheetname" allows us to specify which sheet to read in
df_excel_after2000 = pd.read_excel('resources/energy/EnergyData.xlsx', sheet_name = 'after2000')
df_excel_after2000.head()

Unnamed: 0.1,Unnamed: 0,MSN,YYYYMM,Value,Column_Order,Description,Unit
0,375,TXRCBUS,200001,1098.095,1,Primary Energy Consumed by the Residential Sector,Trillion Btu
1,376,TXRCBUS,200002,985.175,1,Primary Energy Consumed by the Residential Sector,Trillion Btu
2,377,TXRCBUS,200003,740.199,1,Primary Energy Consumed by the Residential Sector,Trillion Btu
3,378,TXRCBUS,200004,564.463,1,Primary Energy Consumed by the Residential Sector,Trillion Btu
4,379,TXRCBUS,200005,382.121,1,Primary Energy Consumed by the Residential Sector,Trillion Btu


In [133]:
df_excel_after2000.shape

(2508, 7)

In [134]:
# We can read in multiple sheets at the same time
# A dictionary will be returned where the key is the sheet name and value is the dataframe

# The list of sheet names can be passed the sheetname or by passing None all sheets will be returned
df_excel_all = pd.read_excel('resources/energy/EnergyData.xlsx', sheet_name = None)
display(df_excel_all['before2000'].head())
display(df_excel_all['after2000'].head())
# print(df_excel_all.keys())

Unnamed: 0.1,Unnamed: 0,MSN,YYYYMM,Value,Column_Order,Description,Unit
0,0,TXRCBUS,194913,4460.434,1,Primary Energy Consumed by the Residential Sector,Trillion Btu
1,1,TXRCBUS,195013,4829.337,1,Primary Energy Consumed by the Residential Sector,Trillion Btu
2,2,TXRCBUS,195113,5104.476,1,Primary Energy Consumed by the Residential Sector,Trillion Btu
3,3,TXRCBUS,195213,5158.193,1,Primary Energy Consumed by the Residential Sector,Trillion Btu
4,4,TXRCBUS,195313,5052.515,1,Primary Energy Consumed by the Residential Sector,Trillion Btu


Unnamed: 0.1,Unnamed: 0,MSN,YYYYMM,Value,Column_Order,Description,Unit
0,375,TXRCBUS,200001,1098.095,1,Primary Energy Consumed by the Residential Sector,Trillion Btu
1,376,TXRCBUS,200002,985.175,1,Primary Energy Consumed by the Residential Sector,Trillion Btu
2,377,TXRCBUS,200003,740.199,1,Primary Energy Consumed by the Residential Sector,Trillion Btu
3,378,TXRCBUS,200004,564.463,1,Primary Energy Consumed by the Residential Sector,Trillion Btu
4,379,TXRCBUS,200005,382.121,1,Primary Energy Consumed by the Residential Sector,Trillion Btu


Can write to Excel file as well

In [135]:
df_excel_all['after2000'].to_excel('resources/energy/after2000.xlsx', index=False)

### Main Points on File I/O with Pandas
There are many options to read and write data. A few common examples are:
* **read_excel:** read from Excel files
* **to_excel:** write to Excel files
* **read_csv:** read from csv files
* **to_csv:** write to csv files

More options can be found [here](https://pandas.pydata.org/pandas-docs/stable/io.html])