In [None]:
from lec_utils import *

<div class="alert alert-info" markdown="1">

#### Lecture 7

# Exploratory Data Analysis, Data Cleaning, and Visualization

### EECS 398-003: Practical Data Science, Fall 2024

<small><a style="text-decoration: none" href="https://practicaldsc.org">practicaldsc.org</a> • <a style="text-decoration: none" href="https://github.com/practicaldsc/fa24">github.com/practicaldsc/fa24</a></small>
    
</div>

### Announcements 📣

- Homework 3 is due on **Thursday**. See [**this post on Ed**](https://edstem.org/us/courses/61012/discussion/5274722) for an important clarification.
- We've slightly adjusted the Office Hours schedule – [**take a look**](https://practicaldsc.org/calendar), and please come by.<br><small>I have office hours after lecture on both Tuesday and Thursday now!</small>

- [**study.practicaldsc.org**](https://study.practicaldsc.org) contains our discussion worksheets (and solutions), which are made up of old exam problems. Use these problems to build your theoretical understanding of the material!

### Agenda

- Merging practice problem.
- Exploratory data analysis.
- Data cleaning.
- Visualization.

<div class="alert alert-warning">
    <h3>Question 🤔 (Answer at <a style="text-decoration: none; color: #0066cc" href="https://docs.google.com/forms/d/e/1FAIpQLSd4oliiZYeNh76jWy-arfEtoAkCrVSsobZxPwxifWggo3EO0Q/viewform">practicaldsc.org/q</a>)</h3>
    
<small>Remember that you can always ask questions anonymously at the link above!</small>
    
Consider the DataFrames `div_one` (left) and `coach` (right), shown below.
    
<center><img src="imgs/merges.png" width=800></center>
    
What should we fill the blank in the `'Region'` column of `coach` so that the DataFrame `div_one.merge(coach, on='Region')` has 9 rows?
- A. `'South'`.
- B. `'West'`.
- C. `'East'`.
- D. `'Midwest'`.

**After you've voted**, try it out yourself below.<br><small>ChatGPT doesn't even know the right answer – try asking it!</small>

In [None]:
div_one = pd.DataFrame().assign(
    Team=['Triton Circus', 'Wolverines', 'Golden Bears', 'Sooners', 'Patriots', 'Bruins'],
    Region=['Midwest', 'Midwest', 'East', 'Midwest', 'West', 'West']
)
coach = pd.DataFrame().assign(
    Team=['Triton Circus', 'Wolverines', 'Golden Bears', 'Sooners', 'Patriots', 'Bruins'],
    Coach=['Coach Jason', 'Coach Jack', 'Coach Jason', 'Coach Ashley', 'Coach Nick', 'Coach Zoe'],
    Region=['____', 'Midwest', 'Midwest', 'East', 'East', 'South'] # Test this out once you've guessed!
)
# div_one.merge(coach, on='Region').shape[0]

## Exploratory data analysis

---

### Dataset overview

- [LendingClub](https://www.lendingclub.com/) is a platform that allows individuals to borrow money – that is, take on **loans**.


- Each row of the dataset corresponds to a different loan that the LendingClub approved and paid out.<br><small>The full dataset is over 300 MB, so we've sampled a subset for this lecture.</small>

In [None]:
loans = pd.read_csv('data/loans.csv')

In [None]:
# Each time you run this cell, you'll see a different random subset of the DataFrame.
loans.sample(5)

In [None]:
# When a DataFrame has more columns than you can see in its preview,
# it's a good idea to check the names of all columns.
loans.columns

- Not all of the columns are necessarily interesting, but here are some that might be:
<br><small>FICO scores refer to credit scores.</small>

In [None]:
# Again, run this a few times to get a sense of the typical values.
loans[['loan_amnt', 'issue_d', 'term', 'int_rate', 'emp_title', 'fico_range_low']].sample(5)

### Lender decision-making

- Larger interest rates make the loan more expensive for the borrower – as a borrower, you want a lower interest rate!

- Even for the same loan amount, different borrowers were approved for different terms and interest rates:

In [None]:
display_df(loans.loc[loans['loan_amnt'] == 3600, ['loan_amnt', 'term', 'int_rate']], rows=17)

- **Why do different borrowers receive different terms and interest rates?**

### Exploratory data analysis (EDA)

- Historically, data analysis was dominated by formal statistics, including tools like confidence intervals, hypothesis tests, and statistical modeling.

- In 1977, John Tukey [defined](https://search.worldcat.org/title/3058187) the term **exploratory data analysis**, which described a philosophy for proceeding about data analysis:

    > Exploratory data analysis is actively incisive, rather than passively descriptive, with real emphasis on the discovery of the unexpected.

- Practically, EDA involves, among other things, computing summary statistics and drawing plots to understand the nature of the data at hand.

    > The greatest gains from data come from surprises… The unexpected is best brought to our attention by **pictures**.

- We'll discuss specific visualization techniques towards the end of the lecture.

### Terminology

<center><img src="imgs/indiv-feat.png" width=1200></center>

- <span style="color:#6d9eeb"><b>Individual (row)</b></span>: Person/place/thing for which data is recorded. Also called an **observation**.

- <span style="color:#ff9900"><b>Feature (column)</b></span>: Something that is recorded for each individual. Also called a **variable** or **attribute**.<br><small>Here, "variable" doesn't mean Python variable!</small>

- There are two key types of features:
    - **Numerical features**: It makes sense to do arithmetic with the values.
    - **Categorical features**: Values fall into categories, that may or may not have some order to them.

### Feature types

<center><img src="imgs/features.png" width=1200></center>

<div class="alert alert-warning">
    <h3>Question 🤔 (Answer at <a style="text-decoration: none; color: #0066cc" href="https://docs.google.com/forms/d/e/1FAIpQLSd4oliiZYeNh76jWy-arfEtoAkCrVSsobZxPwxifWggo3EO0Q/viewform">practicaldsc.org/q</a>)</h3>
    
<small>Remember that you can always ask questions anonymously at the link above!</small>
    
Which of these are **not** a numerical feature?
    
- A. Fuel economy in miles per gallon.
- B. Zip codes.
- C. Number of semesters at Michigan.
- D. Bank account number.
- E. More than one of these are not numerical features.

### Feature types vs. data types

- The data type `pandas` uses is not the same as the "feature type" we talked about just now!<br><small>There's a difference between feature type (categorical, numerical) and computational data type (string, int, float, etc.).</small>

- Be careful: sometimes numerical features are stored as strings, and categorical features are stored as numbers.

In [None]:
# The 'id's are stored as numbers, but are categorical (nominal).
# Are these loan 'id's (unique to each loan) or customer 'id's (which could be duplicated)?
# We'll investigate soon!
loans['id']

In [None]:
# Loan 'term's are stored as strings, but are actually numerical (discrete).
loans['term']

## Data cleaning

---

### Four pillars of data cleaning

When loading in a dataset, to clean the data – that is, to prepare it for further analysis – we will:

1. Perform **data quality checks**.

2. Identify and handle **missing values**.

3. Perform **transformations**, including converting time series data to **timestamps**.

4. Modify **structure** as necessary.

## Data cleaning: Data quality checks

---

### Data quality checks

We often start an analysis by checking the quality of the data.

- Scope: Do the data match your understanding of the population? 

- Measurements and values: Are the values reasonable?

- Relationships: Are related features in agreement?

- Analysis: Which features might be useful in a future analysis? 

### Scope: Do the data match your understanding of the population?

- Who does the data represent?

- We were told that we're only looking at approved loans. What's the distribution of `'loan_status'`es?

In [None]:
loans.columns 

In [None]:
loans['loan_status'].value_counts() 

- Are the loans in our dataset specific to any particular state?

In [None]:
loans['addr_state'].value_counts().head() 

- Were loans only given to applicants with excellent credit? Or were applicants with average credit approved for loans too?

In [None]:
loans['fico_range_low'].describe() 

### Measurements and values: Are the values reasonable?

- What's the distribution of `'loan_amnt'`s?<br><small>Run the cell below, then read [**this article**](https://www.lendingclub.com/help/personal-loan-faq/how-much-can-i-borrow) by the LendingClub.</small>

In [None]:
# It seems like no loans were above $40,000.
loans['loan_amnt'].agg(['min', 'median', 'mean', 'max']) 

- What kinds of information does the `loans` DataFrame even hold?

In [None]:
# The "object" dtype in pandas refers to anything that is not numeric/Boolean/time-related,
# including strings.
loans.info() 

- What's going on in the `'id'` column of `rest`?

In [None]:
# Are there multiple rows with the same 'id'?
# That is, are they person 'id's or loan 'id's?
loans['id'].value_counts().max() 

### Relationships: Are related features in agreement?

- Why are there two columns with credit scores, `'fico_range_low'` and `'fico_range_high'`? What do they both mean?

In [None]:
loans[['fico_range_low', 'fico_range_high']] 

In [None]:
(loans['fico_range_high'] - loans['fico_range_low']).value_counts() 

- Does every `'sub_grade'` align with its related `'grade'`?

In [None]:
loans[['grade', 'sub_grade']] 

In [None]:
# Turns out, the answer is yes!
# The .str accessor allows us to use the [0] operation
# on every string in loans['sub_grade'].
(loans['sub_grade'].str[0] == loans['grade']).all() 

## Data cleaning: Missing values

---

### Missing values

- Next, it's important to check for and handle missing value, i.e. **null** values, as they can have a big effect on your analysis.

In [None]:
# Note that very few of the 'desc' (description) values are non-null!
loans.info() 

In [None]:
# Run this repeatedly to read a random sample of loan descriptions.
for desc in loans.loc[loans['desc'].notna(), 'desc'].sample(3):
    print(desc + '\n')

- The `.isna()` Series method checks whether each element in a Series is missing (`True`) or present (`False`). `.notna()` does the opposite.

In [None]:
# The percentage of values in each column that are missing.
loans.isna().mean().sort_values(ascending=False) * 100 

- It _appears_ that applicants who submitted descriptions with their loan applications were given higher interest rates on average than those who didn't submit descriptions.<br><small>But, this could've happened for a variety of reasons, not just because they submitted a description.</small>

In [None]:
(
    loans
    .assign(submitted_description=loans['desc'].notna())
    .groupby('submitted_description')
    ['int_rate']
    .agg(['mean', 'median'])
)

- There are many ways of handling missing values, which we'll discuss more in a future lecture. But a good first step is to check how many there are!

### Aside: Series operations with null values

- The `numpy`/`pandas` null value, `np.nan`, is typically ignored when using `numpy`/`pandas` operations.

In [None]:
# Note the NaN at the very bottom.
loans['mths_since_last_delinq']

In [None]:
loans['mths_since_last_delinq'].sum() 

- But, `np.nan`s typically aren't ignored when using regular Python operations.

In [None]:
sum(loans['mths_since_last_delinq']) 

- As an aside, the regular Python null value is `None`.

In [None]:
None

In [None]:
np.nan

## Data cleaning: Transformations and timestamps

---

### Transformations

- A **transformation** results from performing some operation on every element in a sequence, e.g. a Series.


- When preparing data for analysis, we often need to:
    - Perform type conversions (e.g. changing the string `'$2.99'` to the float `2.99`).
    - Perform unit conversions (e.g. feet to meters).
    - Extract relevant information from strings.

- For example, we can't currently use the `'term'` column to do any calculations, since its values are stored as strings (despite being numerical).

In [None]:
loans['term']

- Many of the values in `'emp_title'` are stored inconsistently, meaning they mean the same thing, but appear differently. Without further cleaning, this would make it harder to, for example, find the total number of nurses that were given loans.

In [None]:
(loans['emp_title'] == 'registered nurse').sum() 

In [None]:
(loans['emp_title'] == 'nurse').sum() 

In [None]:
(loans['emp_title'] == 'rn').sum() 

### One solution: The `apply` method

- The Series `apply` method allows us to use a function on every element in a Series.

In [None]:
def clean_term(term_string):
    return int(term_string.split()[0])

In [None]:
loans['term'].apply(clean_term) 

- There is also an `apply` method for DataFrames, in which you can use a function on every row (if you set `axis=1`) or every column (if you set `axis=0`) of a DataFrame.

- There is also an `apply` method for DataFrameGroupBy/SeriesGroupBy objects, as you're seeing in Question 2.3 of Homework 3!

### The price of `apply`

- Unfortunately, `apply` runs really slowly!

In [None]:
%%timeit
loans['term'].apply(clean_term)

In [None]:
%%timeit
res = []
for term in loans['term']:
    res.append(clean_term(term))

- **Internally, `apply` actually just runs a `for`-loop!**

- So, when possible – say, **when applying arithmetic operations** – we should work on Series objects directly and avoid `apply`!

In [None]:
%%timeit
loans['int_rate'] // 10 * 10 # Rounds down to the nearest multiple of 10.

In [None]:
%%timeit
loans['int_rate'].apply(lambda y: y // 10 * 10)

- Above, the solution involving `apply` is ~10x slower than the one that uses direct vectorized operations.

### The `.str` accessor

- For string operations, `pandas` provides a convenient `.str` accessor.<br><small>You've seen examples of it in practice already, with `.str.contains`, and also with `.str[0]` earlier in today's lecture.</small>

- Mental model: the operations that come after `.str` are used on every single element of the Series that comes before `.str`.

In [None]:
# Here, we use .split() on every string in loans['term'].
loans['term'].str.split() 

In [None]:
loans['term'].str.split().str[0].astype(int) 

- One might think that it's quicker than `apply`, but it's actually even slower. But, we still use it in practice since it allows us to write concise code.

### Creating timestamps ⏱️

- When dealing with values containing dates and times, it's good practice to convert the values to "timestamp" objects.

In [None]:
# Stored as strings.
loans['issue_d']

- To do so, we use the `pd.to_datetime` function.<br><small>It takes in a date format string; you can see examples of how they work [**here**](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior).</small>

In [None]:
pd.to_datetime(loans['issue_d'], format='%b-%Y') 

### Aside: The `pipe` method🚰

- There are a few steps we've performed to clean up our dataset.
    - Convert loan `'term'`s to integers.
    - Convert loan issue dates, `'issue_d'`s, to timestamps.

- When we manipulate DataFrames, it's best to define individual functions for each step, then use the `pipe` **method** to chain them all together.<br><small>The `pipe` method takes in a function, which itself takes in DataFrame and returns a DataFrame.</small>

In [None]:
def clean_term_column(df):
    return df.assign(
        term=df['term'].str.split().str[0].astype(int)
    )
def clean_date_column(df):
    return (
        df
        .assign(date=pd.to_datetime(df['issue_d'], format='%b-%Y'))
        .drop(columns=['issue_d'])
    )

In [None]:
loans = (
    pd.read_csv('data/loans.csv')
    .pipe(clean_term_column)
    .pipe(clean_date_column)
)
loans

In [None]:
# Same as above, just way harder to read and write.
clean_date_column(clean_term_column(pd.read_csv('data/loans.csv'))) 

### Working with timestamps

- We often want to adjust the granularity of timestamps to see overall trends, or seasonality.

- To do so, use the `resample` DataFrame method ([documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects)).<br><small>Think of it like a version of `groupby`, but for timestamps.</small>

In [None]:
# This shows us the average interest rate given out to loans in every 6 month interval.
loans.resample('6M', on='date')['int_rate'].mean() 

- We can also do arithmetic with timestamps.

In [None]:
# Not meaningful in this example, but possible.
loans['date'].diff() 

In [None]:
# If each loan was for 60 months,
# this is a Series of when they'd end.
# Unfortunately, pd.DateOffset isn't vectorized, so
# if you'd want to use a different month offset for each row
# (like we'd need to, since some loans are 36 months
# and some are 60 months), you'd need to use `.apply`.
loans['date'] + pd.DateOffset(months=60) 

### The `.dt` accessor

- Like with Series of strings, `pandas` has a `.dt` accessor for properties of timestamps ([documentation](https://pandas.pydata.org/docs/user_guide/basics.html#basics-dt-accessors)).

In [None]:
loans['date'].dt.year

In [None]:
loans['date'].dt.month

<div class="alert alert-danger" markdown="1">

#### Reference Section

## Data cleaning: Modifying structure

---

See the posted lecture notebook for more details about another useful DataFrame method, `melt`.

### Reshaping DataFrames

We often **reshape** the DataFrame's structure to make it more convenient for analysis. For example, we can:

- Simplify structure by removing columns or taking a set of rows for a particular period of time or geographic area.<br><small>We already did this!</small>

- Adjust granularity by aggregating rows together.<br><small>To do this, use `groupby` (or `resample`, if working with timestamps).</small>

- Reshape structure, most commonly by using the DataFrame `melt` method to un-pivot a dataframe.

### Using `melt`

- The `melt` method is common enough that we'll give it a special mention.

- We'll often encounter pivot tables (esp. from government data), which we call *wide* data.

- The methods we've introduced work better with *long-form* data, or *tidy* data.

- To go from wide to long, `melt`.

<center><img src='imgs/wide-vs-long.svg' width=40%></center>

### Example usage of `melt`

In [None]:
wide_example = pd.DataFrame({
    'Year': [2001, 2002],
    'Jan': [10, 130],
    'Feb': [20, 200],
    'Mar': [30, 340]
}).set_index('Year')
wide_example

In [None]:
wide_example.melt(ignore_index=False)

## Visualization

---

### Napoleon's March

> "Probably the best statistical graphic ever drawn, this map by Charles Joseph Minard portrays the losses suffered by Napoleon's army in the Russian campaign of 1812." ([source](https://www.edwardtufte.com/tufte/posters))

<center><img src="imgs/minard.jpg" width=1200></center>

### Why visualize?

- Computers are better than humans at crunching numbers, but humans are better at identifying visual patterns.

- Visualizations allow us to understand lots of data quickly – they make it easier to spot trends and **communicate** our results with others.

- There are many types of visualizations; the right choice depends on the type of data at hand.<br><small>In this class, we'll look at scatter plots, line plots, bar charts, histograms, choropleths, and boxplots.</small>

### Choosing the correct type of visualization

- The type of visualization we create depends on the types of features we're visualizing.

- We'll directly learn how to produce the **bolded** visualizations below, but the others are also options.<br><small>See more examples [**here**](https://learningds.org/ch/10/eda_feature_types.html#the-importance-of-feature-types).</small>

| Feature types | Options |
| --- | --- |
| Single categorical feature | **Bar charts**, pie charts, dot plots |
| Single numerical feature | **Histograms**, **box plots**, density curves,<br>rug plots, **violin plots**  |
| Two numerical features | **Scatter plots**, **line plots**, heat maps,<br> contour plots |
| One categorical and one numerical feature<br><small>It really depends on the nature of the features themselves!</small> | **Side-by-side** histograms, **box plots**, or **bar charts**,<br> overlaid line plots or density curves|

- Note that we use the words "plot", "chart", and "graph" mean the same thing.

### `plotly`

- We've used `plotly` in lecture briefly, and you've even used it in Homework 1 and Homework 3, but we've never formally discussed it.

- It's a visualization library that enables **interactive** visualizations.

<center><img src="imgs/plotly.png" width=400></center>

### Using `plotly`

- We can use `plotly` using the `plotly.express` syntax.
    - `plotly` is very flexible, but it can be verbose; `plotly.express` allows us to make plots quickly.
    - See the [**documentation here**](https://plotly.com/python/plotly-express) – it's very rich.<br><small>There are good examples for almost everything!</small>

In [None]:
import plotly.express as px

- Alternatively, we can use `plotly` by setting `pandas` plotting backend to `'plotly'` and using the DataFrame/Series `plot` method.<br><small>By default, the plotting backend is `matplotlib`, which creates non-interactive visualizations.</small>

In [None]:
pd.options.plotting.backend = 'plotly'

- Now, we're going to look at several examples. Focus on _what_ is being visualized and _why_; read the notebook later for the _how_.

### Bar charts

- Bar charts are used to show:
    - The distribution of a single categorical feature, or
    - The relationship between one categorical feature and one numerical feature.

- Usage: `px.bar` / `px.barh` or `df.plot(kind='bar)'` / `df.plot(kind='barh')`.<br><small>`'h'` stands for "horizontal."</small>

- Example: What is the distribution of `'addr_state'`s in `loans`?

In [None]:
# Here, we're using the .plot method on loans['addr_state'], which is a Series.
# We prefer horizontal bar charts, since they're easier to read.
(
    loans['addr_state']
    .value_counts()
    .plot(kind='barh')
)

In [None]:
# A little formatting goes a long way!
(
    loans['addr_state']
    .value_counts()
    .sort_values()
    .head(10)
    .plot(kind='barh', title='States of Residence for Successful Loan Applicants')
    .update_layout()
)

- Example: What is the average `'int_rate'` for each `'home_ownership'` status?

In [None]:
(
    loans
    .groupby('home_ownership')
    ['int_rate']
    .mean()
    .plot(kind='barh', title='Average Interest Rate by Home Ownership Status')
)

In [None]:
# The "ANY" category seems to be an outlier.
loans['home_ownership'].value_counts()

### Side-by-side bar charts

- Instead of just looking at `'int_rate'`s for different, `'home_ownership'` statuses, we could also group by loan `'term'`s, too. As we'll see, `'term'` impacts `'int_rate'` far more than `'home_ownership'`.

In [None]:
(
    loans
    .groupby('home_ownership')
    .filter(lambda df: df.shape[0] > 1) # Gets rid of the "ANY" category.
    .groupby(['home_ownership', 'term'])
    [['int_rate']]
    .mean()
)

- A side-by-side bar chart, which we can create by setting the `color` and `barmode` arguments, makes the pattern clear:

In [None]:
# Annoyingly, the side-by-side bar chart doesn't work properly
# if the column that separates colors (here, 'term')
# isn't made up of strings.
(
    loans
    .assign(term=loans['term'].astype(str) + ' months')
    .groupby('home_ownership')
    .filter(lambda df: df.shape[0] > 1)
    .groupby(['home_ownership', 'term'])
    [['int_rate']]
    .mean()
    .reset_index()
    .plot(kind='bar', 
          y='int_rate', 
          x='home_ownership', 
          color='term', 
          barmode='group',
          title='Average Interest Rate by Home Ownership Status and Loan Term',
          width=800)
)

- **Why do longer loans have higher `'int_rate'`s on average?**

### Histograms

- The previous slide showed the **average** `'int_rate'` for different combinations of `'home_ownership'` status and `'term'`. But what if we want to visualize more about `'int_rate'`s than just their average?

- Histograms are used to show the distribution of a single numerical feature.

- Usage: `px.histogram` or `df.plot(kind='hist')`.

- Example: What is the distribution of `'int_rate'`?

In [None]:
(
    loans
    .plot(kind='hist', x='int_rate', title='Distribution of Interest Rates')
)

- With fewer bins, we see less detail (and less noise) in the shape of the distribution.<br><small>Play with the slider that appears when you run the cell below!</small>

In [None]:
from ipywidgets import interact
def hist_bins(nbins):
    (
        loans
        .plot(kind='hist', x='int_rate', nbins=nbins, title='Distribution of Interest Rates')
        .show()
    )
interact(hist_bins, nbins=(1, 51));

### Box plots and violin plots

- Box plots and violin plots are alternatives to histograms, in that they also are used to show the distribution of quantitative features.<br><small>Learn more about box plots [**here**](https://datatab.net/tutorial/box-plot).</small>

- The benefit to them is that they're easily stacked side-by-side to compare distributions.

- Example: What is the distribution of `'int_rate'`?

In [None]:
(
    loans
    .plot(kind='hist', x='int_rate', title='Distribution of Interest Rates')
)

- Example: What is the distribution of `'int_rate'`, separately for each loan `'term'`?

In [None]:
(
    loans
    .plot(kind='box', y='int_rate', color='term', orientation='v', 
          title='Distribution of Interest Rates by Loan Term')
)

In [None]:
(
    loans
    .plot(kind='violin', y='int_rate', color='term', orientation='v', 
          title='Distribution of Interest Rates by Loan Term')
)

- Overlaid histograms can be used to show the distributions of multiple numerical features, too.

In [None]:
(
    loans
    .plot(kind='hist', x='int_rate', color='term', marginal='box', nbins=20,
          title='Distribution of Interest Rates by Loan Term')
)

### Scatter plots

- Scatter plots are used to show the relationship between two quantitative features.

- Usage: `px.scatter` or `df.plot(kind='scatter')`.

- Example: What is the relationship between `'int_rate'` and debt-to-income ratio, `'dti'`?

In [None]:
(
    loans
    .sample(200, random_state=23)
    .plot(kind='scatter', x='dti', y='int_rate', title='Interest Rate vs. Debt-to-Income Ratio')
)

- There are a multitude of ways that scatter plots can be customized. We can color points based on groups, we can resize points based on another numeric column, we can give them hover labels, etc.

In [None]:
(
    loans
    .assign(term=loans['term'].astype(str))
    .sample(200, random_state=23)
    .plot(kind='scatter', x='dti', y='int_rate', color='term',
          hover_name='id', size='loan_amnt',
          title='Interest Rate vs. Debt-to-Income Ratio')
)

### Line charts

- Line charts are used to show how one quantitative feature changes over time.

- Usage: `px.line` or `df.plot(kind='line')`.

- Example: How many loans were given out each year in our dataset?<br><small>This is likely not true of the market in general, or even LendingClub in general, but just a consequence of where our dataset came from.</small>

In [None]:
(
    loans
    .assign(year=loans['date'].dt.year)
    ['year']
    .value_counts()
    .sort_index()
    .plot(kind='line', title='Number of Loans Given Per Year')
)

- Example: How has the average `'int_rate'` changed over time?

In [None]:
(
    loans
    .resample('6M', on='date')
    ['int_rate']
    .mean()
    .plot(kind='line', title='Average Interest Rate over Time')
)

- Example: How has the average `'int_rate'` changed over time, separately for 36 month and 60 month loans?

In [None]:
(
    loans
    .groupby('term')
    .resample('6M', on='date')
    ['int_rate']
    .mean()
    .reset_index()
    .plot(kind='line', x='date', y='int_rate', color='term',
          title='Average Interest Rate over Time')
)