**Locating missing values**

~~~
ri.isnull()

ri.isnull().sum()
~~~

**Dropping a column**

~~~
ri.drop('county_name',axis='columns',inplace=True)
~~~

**Dropping rows**

~~~
ri.dropna(subset=['stop_date','stop_time'],inplace=True)
~~~

#### Why do data types matter?

- Affects which operations you can perform
- Avoid storing data as strings (when possible)
	- int, float: enables mathematical operations
	- datetime: enables date-based attributes and methods
	- category: uses less memory and runs faster
	- bool: enables logical and mathematical operations

**Fixing a data type**

~~~
apple['price'] = apple.price.astype('float')
~~~

- Dot notation: apple.price
- Bracket notation: apple['price']
	- Must be used on the left side of an assignment statement

**Using datetime format**

~~~
apple.date.str.replace('/', '-')

combined = apple.date.str.cat(apple.time, sep=' ')

apple['date_and_time'] = pd.to_datetime(combined)

apple.set_index('date_and_time', inplace=True)
~~~

**Counting unique values**

- value_counts(): counts the unique values in a Series
- Best suited for categorical data

~~~
ri.stop_outcome.value_counts()

ri.stop_outcome.value_counts(normalize=True) # proportions
~~~

**Rules for filtering by multiple conditions**

- Ampersand &: only include rows that satisfy both conditions
- Pipe |: include rows that satisfy either condition
- Each condition must be surrounded by paretheses
- Conditions can check for equality (==), inequality (!=), etc.
- Can use more than two conditions.

**Correlation, not causation**

- Analyze the relationship between gender and stop outcome
	- Assess whether there is a correlation
- Not going to draw any conclusions about causation
	- Would need additional data and expertise



**Math with Boolean values**

- Mean of a Boolean Series represents the percentage of values that are True

~~~
ri.is_arrested.value_counts(normalize=True)

ri.is_arrested.mean() #  works because == bool
~~~

**Comparing groups using groupby**

- Study the arrest rate by police district

~~~
ri.district.unique()

ri[ri.district == 'Zone K1'].is_arrested.mean()

ri[ri.district == 'Zone K2'].is_arrested.mean()
...

ri.groupby('district').is_arrested.mean()
~~~

**Grouping by multiple categories**

~~~
ri.groupby(['district','driver_gender']).is_arrested.mean()
~~~

**Examining the search types**

~~~
ri.search_type.value_counts(dropna=False) # to show missing values
~~~

- Multiple values are separated by commas
	- Locate 'Inventory' among multiple search types

**Searching for a string**

~~~
ri['inventory'] = ri.search_type.str.contains('Inventory', na=False) # returns False for missing values

print(ri.inventory.dtype) # bool
~~~

**Calculating the inventory rate**

~~~
ri.inventory.mean()
~~~

- 0.5% of all traffic stop resulted in an inventory (including those in which a search was not conducted)

~~~
searched = ri[ri.search_conducted == True]

searched.inventory.mean()
~~~

- 13.3% of searches included an inventory

**Accessing datetime attributes**

~~~
print(apple.date_and_time.dt.month) # prints month per row

apple.set_index('date_and_time', inplace=True)
print(apple.index.month) # without dt accessor
~~~

**Calculating the monthly mean price**

- *apple.groupby('month').price.mean()* is invalid

~~~
monthly_price = apple.groupby(apple.index.month).price.mean()
~~~

**Plotting the monthly mean price**

~~~
import matplotlib.pyplot as plt

monthly_price.plot()

plt.xlabel('Month')
plt.ylabel('Price')
plt.title('Monthly mean stock price for Apple')

plt.show()
~~~

**Resampling the price**

~~~
apple.price.resample('M').mean()

# similar to
# apple.groupby(apple.index.month).price.mean()
# but index consists of last days of months
# rather than the month code
~~~

**Resampling the volume**

~~~
apple.volume.resample('M').mean()
~~~

**Concatenating price and volume**

~~~
monthly_price = apple.price.resample('M').mean()

monthly_volume = apple.volume.resample('M').mean()

monthly = pd.concat([monthly_price, monthly_volume], axis='columns')
~~~

**Plotting price and volume**

~~~
monthly.plot(sublots=True) # separate plots

plt.show()
~~~

**Computing a frequency table**

~~~
table = pd.crosstab(ri.driver_race, ri,driver_gender)

table = table.loc['Asian':'Hispanic']
~~~

- Frequecy table: tally of how many times each combination of values occurs

**Creating a bar plot**

- Much more suitable for showing categorical data (than a line plot)

~~~
table.plot(kind='bar')
plt.show()
~~~

**Stacking the bars**

~~~
table.plot(kind='bar',stacked=True)
plt.show()
~~~

**Analyzing an object column**

- Create a Boolean column: True if the price went up, and False otherwise
- Calculate how often the price went up by taking the column mean
- astype() can't be used in this case (pandas can't infer how to cast objet column)

**Mapping one set of values to another**

~~~
mapping = {'up': True, down: 'False'}

apple['is_up'] = apple.change.map(mapping)

print(apple.is_up.mean())
~~~

**Calculating the search rate**

- Visualize how often searches were performed after each type of violation

~~~
search_rate = ri.groupby('violation').search_conducted.mean()

search_rate.plot(kind='bar')
plt.show()
~~~

**Ordering the bars**

- Order the bars from left to right by size

~~~
search_rate.sort_values()

search_rate.plot(kind='bar')
plt.show()
~~~

**Rotating the bars**

~~~
search_rate.plot(kind='barh')
plt.show()
~~~

**Changing data type from object to category**

~~~
ri.stop_length.unique()
# dtype: object
~~~

- Category type stores the data more efficiently
- Allows you to specify a logical order for the categories

~~~
ri.stop_length.memory_usage(deep=True)
# over 8MB

cats = ['short','medium','long'] # in order!

ri['stop_lenght'] = ri.stop_length.astype('category', ordered=True,categories=cats)

ri.stop_length.memory_usage(deep=True)
# around 3.4MB
~~~

**Using ordered categories**

~~~
print(ri[ri.stop_length > 'short'].shape)

print(ri.groupby('stop_length').is_arrested.mean())
# orders logically
~~~

#### Merging DataFrames

~~~
apple.reset_index(inplace=True)

high = high_low[['DATE','HIGH']]

apple_high = pd.merge(left=apple, right=high, left_on='date', right_on='DATE', how='left')
# joining right onto left df

apple_high.set_index('date_and_time',inplace=True)
# setting the index
~~~

**Examining a multi-indexed Series**

~~~
search_rate = ri.groupby(['violation','driver_gender']).search_conducted.mean()

search_rate.loc['Equipment'] # returns search rate by gender

search_rate.loc['Equipment','M'] # returns a value

search_rate.unstack() # converts into a DataFrame

ri.pivot_table(index='violation', columns='driver_gender', values='search_conducted', aggfunc='mean')
# creates an equivalent df as the previous .unstack()
~~~