# Reading Files and Split Apply Combine

This lesson focuses on reviewing our basics with pandas and extending them to more advanced munging and cleaning.  Specifically, we will discuss how to load data files, work with missing values, use split-apply-combine, use string methods, and work with string and datetime objects.  By the end of this lesson you should feel confident doing basic exploratory data analysis using `pandas`. 

**OBJECTIVES**

- Read local files in as `DataFrame` objects
- Drop missing values
- Replace missing values
- Impute missing values
- Use `.groupby` 
- Use built in `.dt` methods
- Convert columns to `pd.datetime` datatype
- Work with `datetime` objects in pandas.


## Reading Local Files

To read in a local file, we need to pay close attention to our *working directory*.  This means the current location of your work enviornment with respect to your larger computer filesystem.  To find your working directory you can use the `os` library or if your system executes UNIX commands these can be used.

- All files for today live [here](https://github.com/jfkoehler/nyu_bootcamp_spr26/tree/main/data)

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#pip install seaborn

In [None]:
#check working directory
os.getcwd()

In [None]:
#list all files in directory
os.listdir()

In [None]:
#create a data directory/folder


In [None]:
#what's in the data folder?


In [None]:
#what is the path to ufo.csv?


##### `read_csv`

Now, using the path to the `ufo.csv` file, you can create a DataFrame by passing this filepath to the `read_csv` function.

In [None]:
#read in ufo data


In [None]:
# look at first 2 rows


In [None]:
# high level information


In [None]:
# numerical summaries


In [None]:
# categorical summaries


In [None]:
# all summaries


### Missing Values

Missing values are a common problem in data, whether this is because they are truly missing or there is confusion between the data encoding and the methods you read the data in using.

In [None]:
# re-examine ufo info


In [None]:
# one-liner to count missing values


In [None]:
# drop missing values


In [None]:
# fill missing values


In [None]:
# replace missing values with most common value


#### Problem

1. Read in the dataset `churn_missing.csv` in the data folder, assign to a variable `churn` below.

2. Are there any missing values?  What columns are they in and how many are there?

3. What do you think we should do about these?  Drop, replace, impute?

### `groupby`

Often, you are faced with a dataset that you are interested in summaries within groups based on a condition.  The simplest condition is that of a unique value in a single column.  Using `.groupby` you can split your data into unique groups and summarize the results.  

**NOTE**: After splitting you need to summarize!

![](https://www.oreilly.com/api/v2/epubs/9781783985128/files/graphics/5128OS_09_01.jpg)

In [None]:
# sample data
titanic = sns.load_dataset('titanic')
titanic.head()

In [None]:
# male vs. female


In [None]:
# departure location


In [None]:
# survival by sex for each departure


In [None]:
# working with multi-index


In [None]:
# age less than 40 survival rate



#### Problems

In [None]:
tips = sns.load_dataset('tips')

In [None]:
tips.head(2)

1. Average tip for smokers vs. non-smokers.

2. Average bill by day and time.

3. What is another question `groupby` can help us answer here?

4. What does the `as_index` argument do?  Demonstrate an example.

### Plotting from a `DataFrame`

Next class we will introduce two plotting libraries -- `matplotlib` and `seaborn`.  It turns out that a `DataFrame` also inherits a good bit of `matplotlib` functionality, and plots can be created directly from a `DataFrame`.

In [None]:
url = 'https://raw.githubusercontent.com/evorition/astsadata/refs/heads/main/astsadata/data/UnempRate.csv'

In [None]:
unemp = pd.read_csv(url)

In [None]:
#default plot is line
unemp.plot()

In [None]:
unemp.head()

In [None]:
unemp = pd.read_csv(url, index_col = 0)
unemp.head()

In [None]:
unemp.info()

In [None]:
unemp.plot()

In [None]:
unemp.hist()

In [None]:
unemp.boxplot()

In [None]:
#create a new column of shifted measurements
unemp['shifted'] = unemp.shift()

In [None]:
unemp.plot()

In [None]:
unemp.plot(x = 'value', y = 'shifted', kind = 'scatter')

In [None]:
unemp.plot(x = 'value', y = 'shifted', kind = 'scatter', title = 'Unemployment Data', grid = True);

More with `pandas` and plotting [here](https://pandas.pydata.org/docs/user_guide/visualization.html).

#### Exit Ticket

Please complete the exit ticket [here](https://forms.gle/HiJPUKLWzyTWXvGs5).