# Subsetting and filtering data
If you want to type along with me, use [this notebook](https://humboldt.cloudbank.2i2c.cloud/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fbethanyj0%2Fdata271_sp24&branch=main&urlpath=tree%2Fdata271_sp24%2Fdemos%2Fdata271_demo14_live.ipynb) instead. 
If you don't want to type and want to follow along just by executing the cells, stay in this notebook. 

In [None]:
import numpy as np
import pandas as pd

## Import data

In [None]:
# read a csv in your working directory
df = pd.read_csv('earthquakes.csv')
df.head(2)

### Filtering with conditions

In [None]:
# keep only the rows where this boolean statement is true (mag greater than or equal to 7)
df[df.mag >= 7]

In [None]:
# note that the other notation works here too
df[df['mag'] >= 7]

In [None]:
# important columns for earthquakes with magnitude greater than or equal to 7 OR caused a tsunami
df.loc[
    (df.tsunami == 1) | (df.mag >= 7),
    ['mag', 'title', 'tsunami', 'place']
]

In [None]:
# Checking if strings in a column (Series) contain another string


In [None]:
# Just get the earthquakes in California
df.loc[
    (df.place.str.contains('California')),
    ['mag', 'title', 'tsunami', 'place']
]

In [None]:
# We might have missed some-- the USGS has tagged some locations as California and some as CA. USE REGEX!
cali_df = 
cali_df

In [None]:
# if we just want the columns related to magnitude
df.loc[
    (df.place.str.contains('CA|California')),
    [...]
]

In [None]:
# another way
df.loc[
    (df.place.str.contains('CA|California')),
    ...
]

In [None]:
# get all the earthquakes with magnitude between 6.45 and 7.5 (inclusive)
df.loc[...,['mag','magType','title','tsunami','type']]

In [None]:
# another way (a little messier)
df.loc[...,['mag','magType','title','tsunami','type']]

In [None]:
# accessing things that match anything in a list
df.loc[...,['mag','magType','title','tsunami','type']]

### Finding and selecting the minimum and maximum
We might be interested in knowing the lowest and highest magnitude earthquakes which occured in California during the time frame the data frame represents, and also knowing where and when they occured.  Pandas lets us find the index of these extrema and then we can select the entire row.

In [None]:
# get the index of lowest and highest magnitude earthquakes in California
cali_df.mag..., cali_df.mag...

In [None]:
# ERROR! this gives us the POSITION index
cali_df.loc[
    [cali_df.mag.argmin(), cali_df.mag.argmax()],
    ['mag', 'title', 'tsunami', 'place']
]

In [None]:
# get the index LABEL of the lowest and highest magnitude earthquakes in Cali
cali_df.mag..., cali_df.mag...

In [None]:
# This allows us to indwex with loc
cali_df.loc[
    [cali_df.mag.idxmin(), cali_df.mag.idxmax()],
    ['mag', 'title', 'tsunami', 'place']
]

The largest quake in California was in Trinidad! 

## Plotting with Pandas

In [None]:
# histograms
df.plot(kind='hist',y='mag');

In [None]:
# line plots
df.plot(kind='line',x = 'time', y='mag');

In [None]:
# scatter plots
df.plot(kind='scatter',x='gap',y='mag');

In [None]:
# bar charts
df.value_counts('status').plot(kind='bar');

In [None]:
# another notation 
df.plot.hist(y='mag');

### Adding/removing data

*NOTE:* Some pandas methods update the original dataframe. If you want to avoid updating the original, do the following. 

In [None]:
# make a copy that will not modify the original
df_copy = df.copy()

### Columns

In [None]:
# We can filter columns immediately when we import the data
df_sub = pd.read_csv('earthquakes.csv',...)
df_sub.head(3)

In [None]:
# Pandas broadcasts. This is useful for adding new columns

df_sub.head()

In [None]:
# More broadcasting

df_sub.head()

In [None]:
# can be useful for doing additional analysis


In [None]:
# remove columns with .drop


In [None]:
# remove columns with .drop another way


In [None]:
# does not update original
df_sub.head()

In [None]:
# remove columns with .drop and update the original


In [None]:
df_sub.head()

### Rows

In [None]:
# prepare two data frames to demonstrate adding and removing rows
tsunami = df_sub[df_sub.tsunami == 1]
no_tsunami = df_sub[df_sub.tsunami == 0]

tsunami.shape, no_tsunami.shape

In [None]:
# concatenate two dataframes


In [None]:
# concatenate two dataframes with unequal columns


In [None]:
# inner join


In [None]:
# removing rows


## Updating data

In [None]:
cali_df['parsed_place']= 
cali_df

## Activity 

Consider the following jokes:

1. Q: Why don't scientists trust atoms?
    1. Because they make up everything.
2. Q: What do you call fake spaghetti?
    1. An impasta!
3. Q: Why did the scarecrow win an award?
    1. Because he was outstanding in his field.


Create a Pandas dataframe with the jokes in one column, their answers in another column, and your rating of the joke on a scale of 0-5 stars (ints) in another column. 

Compute your average rating of these jokes.

Access the question and answer of your highest rated joke. (output should be a Pandas df with 1(or more) rows and two columns)