# Importing and inspecting data
v.ekc

Here is a basic flow of how you want to tend to your data:

1. Confirm it loaded correctly

2. Inspect its shape and dimensions

3. Identify the data types, making sure the data types loaded as expected

   - determine the categorical features
   - did your numerical features load as float/integers?

5. Identify missing values

6. Filter for meaningful subsets

7. Summarize patterns

In [None]:
import numpy as np
import pandas as pd

## 1. Import data

You can read in from:
- local file in your working directory
- online file

One cool thing is that when you load excel files, you can specify what sheet you want to load!

### 1a: Example loading local .csv file

In [None]:
# To use one of the columns as your row index
df = pd.read_csv('earthquakes.csv',index_col='code')
df.head(2)

### 1b: Example: replace default index values with a column name
You can also choose a column to be the row index names with `index_col`!

In [None]:
# read in an excel file
#  set the sheet name
#  set the row index

df = pd.read_excel('https://github.com/bethanyj0/data271_sp24/blob/main/demos/earthquakes.xlsx?raw=True',sheet_name = 'earthquakes',index_col='code')
df.head(2)

In [None]:
df.index.name

### 1c. Example: Reset the index: `df.reset_index(inplace=True/False)`

If you include `inplace=True`, this will revert the original DataFrame

In [None]:
#df.reset_index().head(2)

df.reset_index(inplace=True)
df.head(2)

## 2. Initial Inspection of the Data

Here is your checklist to inspect the data:

| Code | Purpose |
|------|----------|
| `df.empty()` | Returns `False` if it is non-empty |
| `df.head()` | Displays the first 5 rows of the DataFrame |
| `df.tail()` | Displays the last 5 rows of the DataFrame |
| `df.shape` | Returns the number of rows and columns `(rows, columns)` |
| `df.info()` | Shows column names, data types, and non-null counts |
| `df.columns` | Lists the column names |
| `df.dtypes` | Shows the data type of each column |
| `df.describe()` | Displays summary statistics for numeric columns |


### 2a. Example: how to use the methods to inspect the data

In [None]:
# Is the data frame empty? Did import fail?
df.empty

In [None]:
# display the top few rows
df.head(3)

In [None]:
# inspecting the last three rows
df.tail(3)

In [None]:
# info() gives more information, including the number of non-nulls
df.info()

### Checkin: are there any columns with missing values?

#### Answer

In [None]:
len(df)

It looks like the dataset has 9,332 observations. Looking at the Non-Null Count column, we see that the following features have less than 9,332 non-nulls:
- `aleart`
- `cdi`
- `dmin`
- `felt`
- `gap`
- `mag`
- `magType`
- `mmi`
- `nst`
- `tz`

In [None]:
# summary statistics for the numeric features
df.describe()

## 3. Understanding the variables/features: numeric versus caterogical

We know which features are quantities and which are category from step 2 when when run:
- `df.info()`
- `df.describe()`

Next, we can further explore the numeric and categorical data.

Depending on the data type, we can apply different methods to get a better feeling of the dataset:

| Variable Type | Code | Purpose |
|---------------|------|----------|
| Categorical/Numeric | `df["status"].unique()` | Shows the distinct categories in the column |
| Categorical/Numeric | `df["status"].value_counts()` | Counts how many times each category appears |
| Categorical/Numeric | `df["status"].count()` | Counts non-missing values in the column |
| Numeric | `df["mag"].describe()` | Shows summary statistics (mean, std, min, max, quartiles) |
| Numeric | `df["mag"].mean()` | Computes the average value |
| Numeric | `df["mag"].median()` | Computes the median value |


### 3a. Categorical Features

#### Selecting the categorical features

We can select the categorical values with `df.select_dtypes(object)`

In [None]:
# select all columns with object datatypes
df.select_dtypes(object)

#df.select_dtypes(object).shape

### 3b. Categorical Features
#### Exploring categorical features
We want to find the *levels* of the feature aka the unique categories and the count of each

In [None]:
# we can look for unique values in a column
df.status.unique()

In [None]:
# Get the number of rows in each category
df.status.value_counts()

#### plot it for fun

In [None]:
import matplotlib.pyplot as plt
plt.bar(df.status.unique(), df.status.value_counts());

### 3b. Numeric Feature
#### Grabbing the **numeric** columns
- `df.select_dtypes(int)`
- `df.select_dtypes('number')`

In [None]:
# select all columns with ints
df.select_dtypes(int)

In [None]:
# select all columns with numeric datatypes
df.select_dtypes('number')

### 3d. Numeric Features
#### Accessing summary statistics from `df.describe()`  

We can get the whole readout with `df.describe()` or we can pull out certain statistics with methods such as
- `df.col_name.mean()`
- `df.col_name.median()` or `df.col_name.quantile(0.5)`
- `df.col_name.quantile(0.75)`
- `df.col_name.sum()`

In [None]:
# if we would like to just describe one column, such as mag (magnitude)
df.mag.describe()

In [None]:
# mean of a column
df.mag.mean()

In [None]:
# median
df.mag.median()

In [None]:
# quantile
df.mag.quantile(0.5)

In [None]:
# sum of a column
df.mag.sum()

In [None]:
# min of a column
df.mag.min()

### 3e. Numeric Features
#### Min/max and argmin/argmax and idxmin/idxmax

- `min/max`: returns a float
- `argmin/argmax`: returns the integer index value
- `idxmin/idxmax`: returns the index name/row name

In [None]:
# max of a column
df.mag.max()

In [None]:
# POSITION of maximum (can also use min)
df.mag.argmax()

In [None]:
# INDEX LABEL of maximum (can also use min)
df.mag.idxmax()

### 3f. Numeric Features
#### Sorting

You can sort the values of columns 2 ways:
- `df.col_name.sort_values()`
- `df.sort_values(by=col_name)`

In [None]:
# Sort values in a series
df.mag.sort_values()

In [None]:
# Sort values rows in a dataframe by a value
df.sort_values(by='mag')

In [None]:
# Certain numeric methods won't automatically work on dataframes
#df.max()

### 3g. Numeric Features
#### Subsetting and `df.grouby()`

In [None]:
# You can do multiple columns at once if all numeric
df.loc[:,['mag','gap']].max()

In [None]:
# Get the average of one column based on another column 
df.groupby('status')['mag'].mean()

In [None]:
# Get the average of multiple columns based on another column 
df.groupby('status')[['mag','gap']].mean()

## 4. Filtering DataFrames with boolean indexing

In [None]:
# keep only the rows where this boolean statement is true (mag greater than or equal to 7)
df[df.mag >= 7]

In [None]:
# important columns for earthquakes with magnitude greater than or equal to 7 OR caused a tsunami
df.loc[
    (df.tsunami == 1) | (df.mag >= 7),
    ['mag', 'title', 'tsunami', 'place']
].head(5)

### Filtering DataFrames
#### Boolean mask for a substring: `df.place.str.contains(substring)`

#### Checkin: We want to select the earthquakes that occured in California.

Step 1: how can we create a boolean mask for the observations in column `place` that contains `California`?

##### Answer

In [None]:
mask = df.place.str.contains('California')
mask

#### Checkin: We want to select the earthquakes that occured in California.
#### Boolean mask for a substring: `df.place.str.contains(substring)`
Step 2: Apply the mask for Californian earthquakes to the dataset.

##### Answer

In [None]:
df.loc[mask].head(5)

#### Checkin: Now, let's subset the California earthquakes even further--we only want the columns `['mag', 'title', 'tsunami', 'place']`

##### Answer

In [None]:
df.loc[mask, ['mag', 'title', 'tsunami', 'place']].head(5)

#### Checkin summary

Additionally, we can also include an or statement. Say when we subset the Californian earthquakes, we look for `California` or `CA` in the column `place`.

In [None]:
# We might have missed some-- the USGS has tagged some locations as California and some as CA.
CA_df = df.loc[
    (df.place.str.contains('CA|California')),
    ['mag', 'title', 'tsunami', 'place']
]
CA_df.head(3)

In [None]:
# if we just want the columns related to magnitude
df.loc[
    (df.place.str.contains('CA|California')),
    [col for col in df.columns if 'mag' in col]
].head(3)

## Activity 

### Create a summary table with the magnitude `mag` and the place `place` of the smallest and the largest earthquakes in California

*Hints:*

Use the `CA_df`, the subsetted DataFrame with only Californian earthquakes.

The summary table will have two rows and 2 columns.

The columns should be `mag` and `place`.

The first row should contain the information for the smallest earthquake in California (lowest magnitude) and the second row should contain information for the largest earthquake) in California.

#### More hints

Use `.loc` to reference and subset the columns `mag` and `place`

Since you are using `.loc`, to find the min and max, you want to use the row names, not the index position value!

- `df.colname.idxmin()`
- `df.colname.idxmax()`

#### Answer

In [None]:
# This allows us to index with loc
CA_df.loc[
    [CA_df.mag.idxmin(), CA_df.mag.idxmax()],
    ['mag','place']
]

### How many earthquakes in the dataset had a red alert?

*Hint*: `red` is a categorical feature.

What method can you use to count the occurances of each category?

#### Answer

In [None]:
df.alert.value_counts()['red']

### How many Oregon earthquakes are in the dataset?

Instead of California, let's count the Oregon earthquakes.

Include the column features: `['mag', 'title', 'tsunami', 'place']`

#### Answer

In [None]:
OR_df = df.loc[
    (df.place.str.contains('OR|Oregon')),
    ['mag', 'title', 'tsunami', 'place']
]
OR_df.shape

# Appendix

## 1. What Do I Check When I First Load Data?

| Code | What It Tells Me |
|------|------------------|
| `df.head()` | What does the data look like? |
| `df.tail()` | What do the last rows look like? |
| `df.shape` | How big is the dataset? |
| `df.info()` | Are there missing values? What are the data types? |
| `df.columns` | What variables do I have? |
| `df.dtypes` | Which columns are numeric vs categorical? |
| `df.describe()` | What are the summary statistics of numeric columns? |


## 2. Explore Different Types of Variables

| Question | Code | When to Use |
|----------|------|-------------|
| What categories exist? | `df[col_name].unique()` | Categorical data |
| What is the count of each category? | `df[col_name].value_counts()` | Categorical data |
| How many valid observations are there? | `df[col_name].count()` | Any column |
| What are the summary statistics? | `df[col_name].describe()` | Numeric data |
| What is the average value? | `df[col_name].mean()` | Numeric data |
| What is the middle value? | `df[col_name].median()` | Numeric data |
