
# Pandas for Exploratory Data Analysis

_Author: Kevin Markham (Washington, D.C.)_

---

## Learning Objectives

- Define what Pandas is and how it relates to data science.
- Manipulate Pandas `DataFrames` and `Series`.
- Filter and sort data using Pandas.
- Manipulate `DataFrame` columns.
- Know how to handle null and missing values.

## Lesson Guide

- [What Is Pandas?](#pandas)
- [Reading Files, Selecting Columns, and Summarizing](#reading-files)
    - [Exercise 1](#exercise-one)
    
    
- [Filtering and Sorting](#filtering-and-sorting)
    - [Exercise 2](#exercise-two)
    
    
- [Renaming, Adding, and Removing Columns](#columns)
- [Handling Missing Values](#missing-values)
    - [Exercise 3](#exercise-three)
    
    
- [Split-Apply-Combine](#split-apply-combine)
    - [Exercise 4](#exercise-four)
    
    
- [Selecting Multiple Columns and Filtering Rows](#multiple-columns)
- [Joining (Merging) DataFrames](#joining-dataframes)
- [OPTIONAL: Other Commonly Used Features](#other-features)
- [OPTIONAL: Other Less Used Features of Pandas](#uncommon-features)
- [Summary](#summary)

<a id="pandas"></a>

## What Is Pandas?

- **Objective:** Define what Pandas is and how it relates to data science.

Pandas is a Python library that primarily adds two new datatypes to Python: `DataFrame` and `Series`.

- **Series** == sequence of items, where each item has a unique label (called an index).
- **DataFrame** == a table of data. Each row has a unique label (the row index), and each column has a unique label (the column index).
- Note that each column in a DataFrame can be considered a Series (Series index).

Behind the scenes, these datatypes use **NumPy ("Numerical Python")**.

- NumPy primarily adds the ndarray (n-dimensional array) datatype to Pandas.
- An ndarray is similar to a Python list — it stores ordered data.
- Storing Series and DataFrame data in ndarrays makes Pandas faster and uses less memory than standard Python datatypes. Many libraries (such as scikit-learn) accept ndarrays as input rather than Pandas datatypes, so we will frequently convert between them.

### Using Pandas

Pandas is frequently used in data science because:

- Offers a large set of commonly used functions
- Fast processing
- Large developer community.
- Because many data science libraries also use NumPy to manipulate data, you can easily transfer data between libraries (as we will often do in this class!).
- Pandas is a large library.

- It heavily overrides Python operators, resulting in different syntax.

- Looping through a DataFrame row by row is highly discouraged. Instead, Pandas favors using vectorized functions that operate column by column.

### Class Methods and Attributes

Pandas `DataFrame`s are Pandas class objects and therefore come with attributes and methods. To access these, follow the variable name with a dot. For example, given a `DataFrame` called `users`:

```
- users.index       # accesses the `index` attribute -- note there are no parentheses. attributes are not callable
- users.head()      # calls the `head` method (since there are open/closed parentheses)
- users.head(10)    # calls the `head` method with parameter `10`, indicating the first 10 rows. this is the same as:
- users.head(n=10)  # calls the `head` method, setting the named parameter `n` to have a value of `10`.
```

We know that the `head` method accepts one parameter with an optional name of `n` because it is in the documentation for that method. Let's see how to view the documentation next.

### Viewing Documentation

There are a few ways to find more information about a method.

**Method 1:** In Jupyter, you can quickly view documentation for a method by following the method name by a `?`, as follows:

```
users.head?
```

> ```
Signature: users.head(n=5)
Docstring: Returns first n rows
```

Notice that we would normally invoke this method by calling `users.head(5)`. One quirk of IPython is that the `?` symbol must be the last character in the cell. Otherwise, it might not work.

> The `?` is a shortcut for the built-in Python function `help`, which returns the method's docstring. For example:
> ```
help(users.head)
```

**Method 2:** You can also search online for the phrase "`DataFrame head`", since you are calling the method `head` on the `users` object, which happens to be a `DataFrame`. (`type(users) => pandas.DataFrame`)

You can alternatively search online for `pandas head`, but be careful! `DataFrame` and `Series` both have a `head` method, so make sure you view the documentation for the correct one since they might be called differently. You will know you are looking at the correct documentation page because it will say `DataFrame.head` at the top, instead of `Series.head`.

## Pandas

In [None]:
# Load Pandas into Python
import pandas as pd
from matplotlib import pyplot as plt

%matplotlib inline

In [None]:
# we can see the details about the imported package by referencing its private class propertys:
# We want to know what version our package is, because things change.
print(pd.__name__)
print(pd.__version__)

In [None]:
# Quick aside - what is a magic function? Let's find out - Today we're going to spend a bit of time learning to use documentation
#try running this

%magic

#### Magic Functions in Jupyter

IPython (the jupyter shell) has a set of predefined ‘magic functions’ that you can call with a command line style syntax. There are two kinds of magics, line-oriented and cell-oriented. Line magics are prefixed with the % character and work much like OS command-line calls: they get as an argument the rest of the line, where arguments are passed without parentheses or quotes. Cell magics are prefixed with a double %%, and they are functions that get as an argument not only the rest of the line, but also the lines below it in a separate argument.

[Learn More](https://ipython.readthedocs.io/en/stable/interactive/magics.html)

<a id="reading-files"></a>
### Reading Files, Selecting Columns, and Summarizing

Pandas dramatically simplifies the process of reading in data. When we say "reading in data," we mean loading a file into our machine's memory.

When you have a CSV, for example, and then you double-click to open it in Microsoft Excel, the open file is "read into memory." You can now manipulate the CSV.

When we read data into memory in Python, we are creating an object. We will soon explore this object.

Because we are working with a tbl, we will use the [read table](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_table.html) method.
<br>
<br>
A [delimiter](https://en.wikipedia.org/wiki/Delimiter-separated_values) is a character that separates fields (columns) in the imported file. Just because a file says `.csv` does not necessarily mean that a comma is used as the delimiter. In this case, we have a pipe character as the delimiter for our columns, so we will be using `sep='|'` to tell pandas to 'cut' the columns every time it sees a pipe character in the file.

In [None]:
users = pd.read_csv('./data/user.tbl', sep='|')

**Examine the users data.**

In [None]:
users                   # Print the first 30 and last 30 rows.

In [None]:
type(users)             # DataFrame

In [None]:
users.head()            # Print the first five rows.

In [None]:
users.head(10)          # Print the first 10 rows.

In [None]:
users.tail()            # Print the last five rows.

In [None]:
 # The row index (aka "the row labels" — in this case integers)
users.index     


In [None]:
users.head()

In [None]:
# Column names (which is "an index")
users.columns

In [None]:
# Datatypes of each column — each column is stored as an ndarray, which has a datatype
users.dtypes

In [None]:
# Number of rows and columns
users.shape

#### Classroom Challenge

In [None]:
# print the number of rows as an int


#number of columns


In [None]:
# All values as a NumPy array
users.values

In [None]:
# Concise summary (including memory usage) — useful to quickly see if nulls exist
users.info()

**Select or index data.**<br>
Pandas `DataFrame`s have structural similarities with Python-style lists and dictionaries.  
In the example below, we select a column of data using the name of the column in a similar manner to how we select a dictionary value with the dictionary key.

In [None]:
# Select a column — returns a Pandas Series (essentially an ndarray with an index)
users['gender']

In [None]:
# DataFrame columns are Pandas Series.
type(users['gender'])

In [None]:
# Select one column using the DataFrame attribute.
users.gender

# While a useful shorthand, these attributes only exist
# if the column name has no punctuations or spaces.

**Summarize (describe) the data.**<br>
Pandas has a bunch of built-in methods to quickly summarize your data and provide you with a quick general understanding.

In [None]:
# Describe all numeric columns.
users.describe()

In [None]:
# Describe all object columns (can include multiple types).
users.describe(include=['object'])

In [None]:
# Describe all columns, including non-numeric.
users.describe(include='all')

In [None]:
# Describe a single column — recall that "users.gender" refers to a Series.
users.gender.describe()

In [None]:
# Calculate the mean of the ages.
users.age.mean()

In [None]:
# Draw a histogram of a column (the distribution of ages).
users.age.hist();

**Count the number of occurrences of each value.**

In [None]:
users.occupation.value_counts()     # Most useful for categorical variables

In [None]:
users.occupation.value_counts().plot(kind='barh')     # Quick plot by category

In [None]:
# Can also be used with numeric variables
#   Try .sort_index() to sort by indices or .sort_values() to sort by counts.
users.age.value_counts()

In [None]:
users.age.value_counts().sort_index().plot(kind='bar', figsize=(12,12));     # Bigger plot by increasing age
plt.xlabel('Age');
plt.ylabel('Number of users');
plt.title('Number of users per age');

**Unique values:** Determine the number of distinct values within a given series.

In [None]:
# What are the unique occupations?
users['occupation'].unique()

In [None]:
# HOW MANY distinct occupations are there?
users['occupation'].nunique()

### Quick Summary: Combining What We Have Learned

In [None]:
# Summarize our dataset 
print ("Rows     : " ,users.shape[0])
print ("Columns  : " ,users.shape[1])
print ("\nFeatures : \n" ,users.columns.tolist())
print ("\nMissing values :  ", users.isnull().sum().values.sum())
print ("\nUnique values :  \n",users.nunique())

print("\nFirst 5 Rows :  \n",users.head())

print("\nLast 5 Rows :  \n",users.tail())

<a id="exercise-one"></a>
### Exercise 1: Drinks Across the World

Because we are working with a CSV, we will use the [read CSV](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) method.<br>A [delimiter](https://en.wikipedia.org/wiki/Delimiter-separated_values) is a character that separates fields (columns) in the imported file. Just because a file says `.csv` does not necessarily mean that a comma is used as the delimiter. In this case, we have a tab character as the delimiter for our columns, so we will be using `sep='\t'` to tell pandas to 'cut' the columns every time it sees a [tab character in the file](http://vim.wikia.com/wiki/Showing_the_ASCII_value_of_the_current_character).

In [None]:
# Read drinks.csv into a DataFrame called "drinks".
# Path to csv './data/drinks.csv'

drinks = pd.read_csv('./data/drinks.csv', na_filter=False)

In [None]:
# Print the head and the tail.


In [None]:
# Examine the default index, datatypes, and shape.



In [None]:
# Print the beer_servings Series.


In [None]:
# Calculate the average beer_servings for the entire data set.


In [None]:
# Count the number of occurrences of each "continent" value and see if it looks correct.


### Selecting Columns

We can select columns in two ways. Either we treat the column as an attribute of the DataFrame or we index the DataFrame for a specific element (in this case, the element is a column name).

In [None]:
# show difference between single (Series) and double (DataFrame) bracket notation
#Dataframe
drinks[['country','wine_servings']]

#Series
#drinks['country']
drinks.country

#### Classroom Challenge

In [None]:
# select the country column, Series object
#drinks.country
drinks['country'].head()

In [None]:
# select the country column, DataFrame object
drinks[['country', 'wine_servings']].tail()

**Summary:** selecting > 1 column (must use double brackets!)

<a id="filtering-and-sorting"></a>
### Filtering and Sorting

Filtering and sorting are key processes that allow us to drill into the 'nitty gritty' and cross sections of our dataset.

To filter, we use a process called **Boolean Filtering**, wherein we define a Boolean condition, and use that Boolean condition to filter on our DataFrame.

**Logical filtering: Only show users with age < 20.**

By applying a `boolean mask` to this dataframe, `age < 20`, we can get the following:

In [None]:
users.describe()

In [None]:
# Create a Series of Booleans…
# In Pandas, this comparison is performed element-wise on each row of data.
young_bool = users.age < 20
#young_bool

In [None]:
# …and use that Series to filter rows.
# In Pandas, indexing a DataFrame by a Series of Booleans only selects rows that are True in the Boolean.
users[young_bool]

In [None]:
# Or, combine into a single step.
#users[users.gender == 'F']

users_noother= users[users.occupation != 'other']

In [None]:
users_noother.head()

In [None]:
# Important: This creates a view of the original DataFrame, not a new DataFrame.
# If you alter this view (e.g., by storing it in a variable and altering that)
# You will alter only the slice of the DataFrame and not the actual DataFrame itself
# Here, notice that Pandas gives you a SettingWithCopyWarning to alert you of this.

# It is best practice to use .loc and .iloc instead of the syntax below

users_under20 = users[users.age < 20].copy()   # To resolve this warning, copy the `DataFrame` using `.copy()`.
users_under20['newcolumn'] = 0
users_under20

In [None]:
# Select one column from the filtered results.
users[users.age < 20][['occupation']].describe()

In [None]:
# value_counts of resulting Series
users[users.age < 20].occupation.value_counts()

**Logical filtering with multiple conditions**

In [None]:
# Ampersand for `AND` condition. (This is a "bitwise" `AND`.)
# Important: You MUST put parentheses around each expression because `&` has a higher precedence than `<`.
users[(users.age <20) & (users.gender=='F')]

In [None]:
# Pipe for `OR` condition. (This is a "bitwise" `OR`.)
# Important: You MUST put parentheses around each expression because `|` has a higher precedence than `<`.
users[((users.age < 20) | (users.age > 60)) & (users.gender=='M')]

In [None]:
# Preferred alternative to multiple `OR` conditions
#users[((users.occupation =='doctor') | (users.occupation =='lawyer') | (users.occupation =='programmer'))]
user_doclaw = users[users.occupation.isin(['doctor', 'lawyer','programmer'])].copy()

In [None]:
user_doclaw

**Sorting**

In [None]:
# Sort a Series.
users.age.sort_values()

In [None]:
# Sort a DataFrame by a single column.
users.sort_values('age')

In [None]:
# Use descending order instead.
users.sort_values('age', ascending=False)

In [None]:
# Sort by multiple columns.
users.sort_values(['occupation', 'age'], ascending=[True,False])

<a id="exercise-two"></a>
### Exercise 2: Filtered Drinks
Use the `drinks.csv` or `drinks` `DataFrame` from earlier to complete the following.

In [None]:
drinks.head()

In [None]:
# Filter DataFrame to only include European countries.


In [None]:
# Filter DataFrame to only include European countries with wine_servings > 300.


In [None]:
# Calculate the average beer_servings for all of Europe.


In [None]:
# Determine which 10 countries have the highest total_litres_of_pure_alcohol.


<a id="columns"></a>
### Renaming, Adding, and Removing Columns


Perhaps we want to rename our columns. There are a few options for doing this.

In [None]:
# Are beer servings and spirit servings correlated?
drinks.plot(kind='scatter', x='beer_servings', y='spirit_servings');

print((drinks.corr()['beer_servings']))  # Correlation coefficients

Renaming **specific** columns by using a dictionary:

In [None]:
# Rename one or more columns in a single output using value mapping.
drinks.rename(columns={'beer_servings':'beer', 'wine_servings':'wine'})


In [None]:
drinks.head()

In [None]:
# Rename one or more columns in the original DataFrame.
drinks.rename(columns={'beer_servings':'beer', 'wine_servings':'wine'}, inplace=True)

In [None]:
drinks.head()

**Replace all column names using a list of matching length.**

Replace during file reading (disables the header from the file).

In [None]:
#List of matching length
drink_cols = ['country', 'beer', 'spirit', 'wine', 'litres','continent'] 

# Read in data with new columns
drinks = pd.read_csv('data/drinks.csv', header=0, names=drink_cols)


In [None]:
drinks.columns

In [None]:
drinks.head()

Replace after file has already been read into Python.

In [None]:
#List of matching length
drink_cols = ['country', 'beer', 'spirit', 'wine', 'liters', 'cont'] 

# Replace after file has already been read into Python.
drinks.columns = drink_cols

In [None]:
drinks.columns

Use list functions to modify your list of names

In [None]:
drink_cols.pop()
drink_cols.append('continent')

drinks.columns = drink_cols

We can use list indexing to mutate the columns we want:

In [None]:
# mutate list
drink_cols[0]='Country'
drink_cols[4]='ltrs'


drinks.columns = drink_cols

In [None]:
drinks.columns

**Easy Column Operations**<br>
Rather than having to reference indexes and create for loops to do column-wise operations, Pandas is smart and knows that when we add columns together we want to add the values in each row together.

In [None]:
# Add a new column as a function of existing columns.
drinks['servings'] = drinks.beer + drinks.spirit + drinks.wine
drinks['mL'] = drinks.ltrs * 1000

drinks.head()

**Removing Columns**

In [None]:
# axis=0 for rows, 1 for columns
drinks.drop('mL', axis=1)
drinks.drop(5, axis=1)

In [None]:
drinks.head()

In [None]:
# Drop multiple columns.
drinks.drop(['mL', 'servings'], axis=1)

In [None]:
# Drop on the original DataFrame rather than returning a new one.
drinks.drop(['mL', 'servings'], axis=1, inplace=True)

In [None]:
drinks.head()

In [None]:
drinks.drop(5, axis=1, inplace=True)

In [None]:
drinks

<a id="missing-values"></a>
### Handling Missing Values

- **Objective:** Know how to handle null and missing values.

Sometimes, values will be missing from the source data or as a byproduct of manipulations. It is very important to detect missing data. Missing data can:

- Make the entire row ineligible to be training data for a model.
- Hint at data-collection errors.
- Indicate improper conversion or manipulation.
- Actually not be missing — it sometimes means "zero," "false," "not applicable," or "entered an empty string."

For example, a `.csv` file might have a missing value in some data fields:

```
tool_name,material,cost
hammer,wood,8
chainsaw,,
wrench,metal,5
```

When this data is imported, "null" values will be stored in the second row (in the "material" and "cost" columns).

> In Pandas, a "null" value is either `None` or `np.NaN` (Not a Number). Many fixed-size numeric datatypes (such as integers) do not have a way of representing `np.NaN`. So, numeric columns will be promoted to floating-point datatypes that do support it. For example, when importing the `.csv` file above:

> - **For the second row:** `None` will be stored in the "material" column and `np.NaN` will be stored in the "cost" column. The entire "cost" column (stored as a single `ndarray`) must be stored as floating-point values to accommodate the `np.NaN`, even though an integer `8` is in the first row.

In [None]:
drinks.tail(30)

In [None]:
# Missing values are usually excluded in calculations by default.
drinks.continent.value_counts()              # Excludes missing values in the calculation

In [None]:
# Includes missing values
drinks.continent.value_counts(dropna=False)

In [None]:
# Find missing values in a Series.
# True if missing, False if not missing
drinks.continent.isnull()

In [None]:
# Count the missing values — sum() works because True is 1 and False is 0.
drinks.continent.isnull().sum()

In [None]:
# True if not missing, False if missing
drinks.continent.notnull()

In [None]:
# Only show rows where continent is not missing.
drinks[drinks.continent.isnull()]

**Understanding Pandas Axis**

In [None]:
# Sums "down" the 0 axis (rows) — so, we get the sums of each column
drinks.sum(axis=0)

In [None]:
# axis=0 is the default.
drinks.sum()

In [None]:
drinks.head()

In [None]:
# Sums "across" the 1 axis (columns) — so, we get the sums of numeric values in the row (beer+spirit+wine+liters+…)
drinks['totalsum']=drinks.sum(axis=1)

In [None]:
drinks.head()

In [None]:
drinks.dtypes

**Find missing values in a `DataFrame`.**

In [None]:
# DataFrame of Booleans
drinks.isnull()

In [None]:
# Count the missing values in each column — remember by default, axis=0.
print((drinks.isnull().sum()))

drinks.isnull().sum().plot(kind='bar');         # visually
plt.title('Number of null values per column');

**Dropping Missing Values**

In [None]:
# Drop a row if ANY values are missing from any column — can be dangerous!
drinks.dropna(inplace=True)

In [None]:
# Drop a row only if ALL values are missing.
drinks.dropna(how='all')

In [None]:
#double check missing values are gone
drinks[drinks.isnull().any(axis=1)]

**Filling Missing Values**<br>
You may have noticed that the continent North America (NA) does not appear in the `continent` column. Pandas read in the original data and saw "NA", thought it was a missing value, and converted it to a `NaN`, missing value.

In [None]:
#List of matching length
drink_cols = ['country', 'beer', 'spirit', 'wine', 'litres','continent'] 

# Read in data with new columns
drinks = pd.read_csv('data/drinks.csv', header=0, names=drink_cols)

In [None]:
drinks.head(20)

In [None]:
# Fill in missing values with "NA" — this is dangerous to do without manually verifying them!
drinks.continent.fillna(value='Island')

In [None]:
# Modifies "drinks" in-place
drinks.continent.fillna(value='Island', inplace=True)

In [None]:
drinks.head(20)

In [None]:
help(pd.read_csv)

In [None]:
# Turn off the missing value filter — this is a better approach!
drinks = pd.read_csv('./data/drinks.csv', header=0, names=drink_cols, na_filter=False)

In [None]:
drinks.head(20)

<a id="exercise-three"></a>
### Exercise 3: UF-uh oh

In [None]:
# Read ufo.csv into a DataFrame called "ufo".
ufo_data = './data/ufo.csv'
ufo = pd.read_csv(ufo_data)


In [None]:
# Check the shape of the DataFrame.


In [None]:
# What are the three most common colors reported?


In [None]:
# Rename any columns with spaces so that they don't contain spaces.


In [None]:
# Checking your work is a great step


In [None]:
# For reports in VA, what's the most common city?


In [None]:
# Print a DataFrame containing only reports from Arlington, VA.


In [None]:
# Count the number of missing values in each column.


In [None]:
# How many rows remain if you drop all rows with any missing values?


In [None]:
# How many rows did we lose by removing all rows with any missing values?


<a id="split-apply-combine"></a>
### Split-Apply-Combine

Split-apply-combine is a pattern for analyzing data. Suppose we want to find mean beer consumption per country. Then:

- **Split:** We group data by continent.
- **Apply:** For each group, we apply the `mean()` function to find the average beer consumption.
- **Combine:** We now combine the continent names with the `mean()`s to produce a summary of our findings.

In [None]:
# For each continent, calculate the mean beer servings.
drinks.groupby('continent').beer.mean()

In [None]:
# For each continent, calculate the mean of all numeric columns.
drinks.groupby('continent').mean()

In [None]:
# For each continent, describe beer servings.
drinks.groupby(['continent','country']).beer.describe()

In [None]:
# Similar, but outputs a DataFrame and can be customized — "agg" allows you to aggregate results of Series functions
#drinks.groupby('continent').beer.agg(['count', 'mean', 'min', 'max'])
drinks.groupby('continent').beer.agg(['count','min', 'max']).sort_values('count', ascending=False)

In [None]:
# For each continent, describe all numeric columns.
drinks.groupby('continent').describe()

In [None]:
# For each continent, count the number of rows.
#print((drinks.groupby('continent').continent.count()))
#print((drinks.continent.value_counts()))   # should be the same

<a id="exercise-four"></a>
### Exercise 4: Users

Use the "users" `DataFrame` or "users" file in the Data folder to complete the following.

In [None]:
users.head(1)

In [None]:
# For each occupation in "users", count the number of occurrences.


In [None]:
# For each occupation, calculate the mean age.


In [None]:
# For each occupation, calculate the minimum and maximum ages.


In [None]:
# For each combination of occupation and gender, calculate the mean age.


----

<a id="multiple-columns"></a>
### Selecting Multiple Columns and Filtering Rows

In [None]:
# Select multiple columns — yet another overload of the DataFrame indexing operator!
my_cols = ['city', 'state']     # Create a list of column names...
ufo[my_cols]                    # ...and use that list to select columns.

In [None]:
ufo[['City', 'State']].head(1)

In [None]:
# Or, combine into a single step (this is a Python list inside of the Python index operator!).
ufo[['City', 'State']]

**Use `loc` to select columns by name.**

In [None]:
# "loc" locates the values from the first parameter (colon means "all rows"), and the column "City".
ufo.loc[:, 'city']

In [None]:
# Select two columns.
ufo.loc[:, ['city', 'state']]

In [None]:
ufo.columns

In [None]:
# Select a range of columns — unlike Python ranges, Pandas index ranges INCLUDE the final column in the range.
ufo.loc[:, 'city':'shape']

In [None]:
ufo.head()

In [None]:
# "loc" can also filter rows by "name" (the index).
# Row 0, all columns
ufo.loc[1, 'city':'state']

In [None]:
# Rows 0/1/2, all columns
ufo.loc[0:10, 'city':'shape']

In [None]:
# Rows 0/1/2, range of columns
ufo.loc[0:2, 'City':'State'] 

In [None]:
ufo.head(1)

In [None]:
# Use "iloc" to filter rows and select columns by integer position.
# (Remember that rows/columns use indices, so "iloc" lets you refer to indices via their index rather than value!)
# All rows, columns in position 0/3 (City/State)
ufo.iloc[:, [0, 3]]

In [None]:
# All rows, columns in position 0/1/2/3
# Note here it is NOT INCLUDING 4 because this is an integer range, not a Pandas index range!
ufo.iloc[:, 0:4]

In [None]:
# Rows in position 0/1/2, all columns
ufo.iloc[0:3, 2:4] 

<a id="joining-dataframes"></a>
### Joining (Merging) `DataFrames`

In [None]:
#import pandas as pd
movie_cols = ['movie_id', 'title']
movies_filename = './data/movies.tbl'

movies = pd.read_table(movies_filename, sep='|', header=None, names=movie_cols, usecols=[0, 1], encoding='latin1')
movies.head()

In [None]:
rating_cols = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings_filename = './data/movie_ratings.tsv'

ratings = pd.read_table(ratings_filename, sep='\t', header=None, names=rating_cols)
ratings.head()

In [None]:
# Merge "movies" and "ratings" (inner join on "movie_id").
movie_ratings = pd.merge(movies, ratings)
movie_ratings.head()

In [None]:
print(movies.shape)
print(ratings.shape)
print(movie_ratings.shape)

<a id="other-features"></a>
### OPTIONAL: Other Commonly Used Features

In [None]:
# Apply an arbitrary function to each value of a Pandas column, storing the result in a new column.
users['under30'] = users.age.apply(lambda age: age < 30)
users.head()

In [None]:
# Apply an arbitrary function to each row of a DataFrame, storing the result in a new column.
#  (Remember that, by default, axis=0. Since we want to go row by row, we set axis=1.)

users['under30male'] = users.apply(lambda row: row.age < 30 and row.gender == 'M', axis=1)

In [None]:
# Map existing values to a different set of values.
users['is_male'] = users.gender.map({'F':0, 'M':1})

In [None]:
users.head()

In [None]:
# Replace all instances of a value in a column (must match entire value).
ufo.State.replace('Fl', 'FL', inplace=True)

In [None]:
# String methods are accessed via "str".
ufo.State.str.upper()                               # Converts to upper case
# checks for a substring
ufo['Colors'].str.contains('RED', na='False') 

In [None]:
ufo['Time'] = pd.to_datetime(ufo.Time)
ufo.dtypes

In [None]:
# Convert a string to the datetime format (this is often slow — consider doing it in the "read_csv()" method.)
ufo['Time'] = pd.to_datetime(ufo.Time)
ufo.Time.dt.hour                        # Datetime format exposes convenient attributes
(ufo.Time.max() - ufo.Time.min()).days  # Also allows you to do datetime "math"

In [None]:
# Set and then remove an index.
ufo.set_index('Time', inplace=True)


In [None]:
ufo.index

In [None]:
# Change the datatype of a column.
drinks['beer'] = drinks.beer.astype('float')

In [None]:
# Create dummy variables for "continent" and exclude first dummy column.
continent_dummies = pd.get_dummies(drinks.continent, prefix='cont').iloc[:, 1:]
continent_dummies.head()

In [None]:
# Concatenate two DataFrames (axis=0 for rows, axis=1 for columns).
drinks = pd.concat([drinks, continent_dummies], axis=1)

<a id="uncommon-features"></a>
### OPTIONAL: Other Less-Used Features of Pandas

In [None]:
# Detecting duplicate rows
users.duplicated()          # True if a row is identical to a previous row
users.duplicated().sum()    # Count of duplicates
users[users.duplicated()]   # Only show duplicates
users.drop_duplicates()     # Drop duplicate rows
users.age.duplicated()      # Check a single column for duplicates
users.duplicated(['age', 'gender', 'zip_code']).sum()   # Specify columns for finding duplicates

In [None]:
# Convert a range of values into descriptive groups.
drinks['beer_level'] = 'low'    # Initially set all values to "low"
drinks.loc[drinks.beer.between(101, 200), 'beer_level'] = 'med'     # Change 101-200 to "med"
drinks.loc[drinks.beer.between(201, 400), 'beer_level'] = 'high'    # Change 201-400 to "high"

In [None]:
# Display a cross-tabulation of two Series.
pd.crosstab(drinks.continent, drinks.beer_level)

In [None]:
# Convert "beer_level" into the "category" datatype.
drinks['beer_level'] = pd.Categorical(drinks.beer_level, categories=['low', 'med', 'high'])
drinks.sort_values('beer_level')   # Sorts by the categorical ordering (low to high)

In [None]:
# Limit which rows are read when reading in a file — useful for large files!
pd.read_csv('./data/drinks.csv', nrows=10)           # Only read first 10 rows
pd.read_csv('./data/drinks.csv', skiprows=[1, 2])    # Skip the first two rows of data

In [None]:
# Write a DataFrame out to a .csv
drinks.to_csv('drinks_updated.csv')                 # Index is used as first column
drinks.to_csv('drinks_updated.csv', index=False)    # Ignore index

In [None]:
# Create a DataFrame from a dictionary.
pd.DataFrame({'capital':['Montgomery', 'Juneau', 'Phoenix'], 'state':['AL', 'AK', 'AZ']})

In [None]:
# Create a DataFrame from a list of lists.
pd.DataFrame([['Montgomery', 'AL'], ['Juneau', 'AK'], ['Phoenix', 'AZ']], columns=['capital', 'state'])

In [None]:
# Randomly sample a DataFrame.
import numpy as np
mask = np.random.rand(len(drinks)) < 0.66   # Create a Series of Booleans
train = drinks[mask]                        # Will contain around 66% of the rows
test = drinks[~mask]                        # Will contain the remaining rows

In [None]:
# Change the maximum number of rows and columns printed ('None' means unlimited).
pd.set_option('max_rows', None)     # Default is 60 rows
pd.set_option('max_columns', None)  # Default is 20 columns
print(drinks)

In [None]:
# Reset options to defaults.
pd.reset_option('max_rows')
pd.reset_option('max_columns')

In [None]:
# Change the options temporarily (settings are restored when you exit the "with" block).
with pd.option_context('max_rows', None, 'max_columns', None):
    print(drinks)

<a id="summary"></a>
### Summary

Believe it or not, we've only barely touched the surface of everything that Pandas offers. Don't worry if you don't remember most of it — for now, just knowing what exists is key. Remember that the more you use Pandas to manipulate data, the more of these functions you will take interest in, look up, and remember.

In this notebook, the most important things to familiarize yourself with are the basics:
- Manipulating `DataFrames` and `Series`
- Filtering columns and rows
- Handling missing values
- Split-apply-combine (this one takes some practice!)

## Recap

We covered a lot of ground! It's ok if this takes a while to gel.

```python

# basic DataFrame operations
df.head()
df.tail()
df.shape
df.columns
df.Index

# selecting columns
df.column_name
df['column_name']

# renaming columns
df.rename({'old_name':'new_name'}, inplace=True)
df.columns = ['new_column_a', 'new_column_b']

# notable columns operations
df.describe() # five number summary
df.column_name.nunique() # number of unique values
df.column_name.value_counts() # number of occurrences of each value in column

# filtering
df[df.column_name < 50] # filter column to be less than 50

# sorting
df.sort_values(by='column_name', ascending = False) # sort biggest to smallest

```


It's common to refer back to your own code *all the time.* Don't hesistate to reference this guide! 🐼