<div class='heading'>
    <div style='float:left;'><h1>CPSC 4300/6300: Applied Data Science</h1></div>
    <img style="float: right; padding-right: 10px; width: 65px" src="https://raw.githubusercontent.com/bsethwalker/clemson-cs4300/main/images/clemson_paw.png"> </div>

## Week 2 | Homework : EDA and Visualization

**Clemson University** **Instructor(s):** Tim Ransom

----------------------
## Learning goals

- Identify trends and patterns in data using visualization techniques.
- Interpret descriptive statistics for a dataset.
- Create effective data visualizations using matplotlib.
- Evaluate the effectiveness of different visualization methods.
- Clean and prepare data for analysis.

## INSTRUCTIONS

-   Restart the kernel and run the whole notebook again before you
    submit.
-   As much as possible, try and stick to the hints and functions we
    import at the top of the homework, as those are the ideas and tools
    the class supports and is aiming to teach. And if a problem
    specifies a particular library you're required to use that library,
    and possibly others from the import list.
-   Please use .head() when viewing data. 

## About

This exercise relates to the College data set, which can be found in the
file
[College.csv](http://faculty.marshall.usc.edu/gareth-james/ISL/College.csv).
It contains a number of variables for 777 different universities and
colleges in the US. The variables are

-   `Private`: Public/private indicator
-   `Apps`: Number of applications received
-   `Accept`: Number of applicants accepted
-   `Enroll`: Number of new students enrolled
-   `Top 10 percent`: New students from top10% of high school class
-   `Top 25 percent`: New students from top 25% of high school class
-   `F.Undergrad`: Number of full-time. undergraduates
-   `P.Undergrad`: Number of part-time undergraduates
-   `Outstate`: Out-of-state tuition
-   `Room.Board`: Room and board costs
-   `Books`: Estimated book costs
-   `Personal`: Estimated personal spending
-   `PhD`: Percent of faculty with Ph.D.'s
-   `Terminal`: Percent of faculty with terminal degree • S.F.Ratio:
    Student/faculty ratio
-   `perc alumni`: Percent of alumni who donate
-   `expend`: Instructional expenditure per student
-   `grad.Rate`: Graduation rate

In [None]:
""" RUN THIS CELL TO GET THE RIGHT FORMATTING """
import requests
from IPython.core.display import HTML
css_file = 'https://raw.githubusercontent.com/bsethwalker/clemson-cs4300/main/css/cpsc6300.css'
styles = requests.get(css_file).text
HTML(styles)

In [None]:
import pandas as pd 
import numpy as np
from pandas.plotting import scatter_matrix 
from matplotcheck.base import PlotTester
from matplotlib.patches import PathPatch
import matplotlib.pyplot as plt 
import seaborn as sns
sns.set()
%matplotlib inline 

<div class='exercise'> <b> Exercise 1: Load Data set</b> </div>

- Using Pandas, load the `College.csv` file from 'data/College.csv' into a DataFrame named `college`. 
- Take a look at the loaded data.
- Rename the first column as `Name` and display the first few rows of the DataFrame.

In [None]:
"""Write your code for exercise-1 here:"""

college = pd.read_csv('data/College.csv')
college = college.rename(columns={college.columns[0]: 'Name'})
college.head()

Notice here that the table that was output through the jupyter notebook has additional formatting rendered with it. When we ask in this class to "report" a dataframe, know that the last line of a code cell be the variable you are reporting will give this nice formatting.

In [None]:
# this will render good (enough) looking formatting for the data
college

In [None]:
# this will print plain text
print(college)

<div class='exercise'> <b> Exercise 2: Elite Institutions</b> </div>

- Create a new qualitative variable, called `Elite`, by binning the Top 10 percent variable. 
- Here you are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50%.
  - Categorize `Elite` column into `Yes` or `No`.(i.e `Elite` column should only contain `Yes` or `No` values.)
  - Value for `Elite` column shoud be `Yes` if the proportion of students coming from the top 10% of their high school classes is greater than 50% (i.e top10perc > 50)
  - Else `Elite` column should be `No`

In [None]:
"""Write your code for exercise-2 here:"""

college['Elite'] = np.where(college['Top10perc'] > 50, 'Yes', 'No')

<div class='exercise'> <b> Exercise 3: Acceptance Rates</b> </div>

- Create a new column called `AcceptRate` that contains the acceptance rate for each university.
- Calculate aaceptance rate using following formula:
    - $(Accept / Apps) * 100$

In [None]:
"""Write your code for exercise-3 here:"""

college['AcceptRate'] = (college['Accept'] / college['Apps']) * 100

<div class='exercise'> <b> Exercise 4:</b> </div>

- How many elite schools are there?
- Extract and store number of elite schools form our dataset to variable named `num_elite_schools`.

In [None]:
"""Write your code for exercise-4 here:"""

num_elite_schools = (college['Elite'] == 'Yes').sum()

<div class='exercise'> <b> Exercise 5: Acceptance Rate Comparison</b> </div>

- Create a boxplot comparing the acceptance rates of elite and non-elite universities. 
   1. To create the box plot, you need to set up the Figure and Axes objects using `plt.subplots()`. 
        - These two objects (fig and ax) will allow you to control various properties of the figure and plot.
        - `fig` is the Figure object: It serves as the overall container for the plot.
        - `ax` is the Axes object: This is where the actual data points will be plotted, including x and y axes, labels, etc.
        ```python
            Example code: fig, ax = plt.subplots(figsize=(10, 6))
        ```
   2. Customize the Plot: 
        - You need to set the title of the plot, the labels for the x-axis and y-axis, and format the ticks on the x-axis for better readability.

In [None]:
"""Write your code for exercise-5 here:"""

fig, ax = plt.subplots(figsize=(10, 6))
sns.boxplot(x='Elite', y='AcceptRate', data=college, ax=ax)
ax.set_title('Acceptance Rate by Elite Status')
ax.set_xlabel('Elite')
ax.set_ylabel('Acceptance Rate (%)')
plt.show()

<div class='exercise'> <b> Exercise 6: Cost Comparisons</b> </div>

- Create two side-by-side histograms (using subplots) showing the distribution of out of state tuition for elite and non-elite institutions. 

    1. To create plot, you need to set up the Figure and Axes objects using `plt.subplots()`. 
        - These two objects (fig and ax) will allow you to control various properties of the figure and plot.
        - `fig` is the Figure object: It serves as the overall container for the plot.
        - `ax` is the Axes object: This is where the actual data points will be plotted, including x and y axes, labels, etc.
        ```python
            Example code: fig, ax = plt.subplots(1, 2, figsize=(10, 6))
        ```
        - Here `plt.subplots(1, 2)` creates one row and two columns of subplots. This results in two individual Axes that will be side by side.
        
  2. Customize the Plot: 
        - You need to set the title of the plot, the labels for the x-axis and y-axis, and format the ticks on the x-axis for better readability.


In [None]:
"""Write your code for exercise-6 here:"""
fig, ax = plt.subplots(1, 2, figsize=(10, 6), sharey=True)
elite = college[college['Elite'] == 'Yes']
non_elite = college[college['Elite'] == 'No']
ax[0].hist(elite['Outstate'], bins=50, color='orange', edgecolor='black')
ax[0].set_title('Elite Institutions')
ax[0].set_xlabel('Out-of-State Tuition')
ax[0].set_ylabel('Count')
ax[1].hist(non_elite['Outstate'], bins=50, color='purple', edgecolor='black')
ax[1].set_title('Non-Elite Institutions')
ax[1].set_xlabel('Out of State Tuition')
plt.tight_layout()
plt.show()

<div class='exercise'> <b> Exercise 7:</b></div>

- Which University has the most students in the top 10% of class?
- Extract the name of University that has the most students in the top 10% of class and store in new variable named `top_university`.

In [None]:
"""Write your code for exercise-7 here:"""
top_university = college.loc[college['Top10perc'].idxmax(), 'Name']
# print(top_university)

<div class='exercise'> <b> Exercise 8:</b> </div>

- Which university has the smallest acceptance rate?
- Extract the name of University that has the smallest acceptance rate and store in new variable named `university_smallest_accept_rate`.

In [None]:
"""Write your code for exercise-8 here:"""
university_smallest_accept_rate = college.loc[college['AcceptRate'].idxmin(), 'Name']
# print(university_smallest_accept_rate)

<div class='exercise'> <b> Exercise 9:</b> </div>

- Which university has the most liberal acceptance rate?
- Extract the name of University that has the most liberal acceptance rate and store in new variable named `university_most_liberal_accept_rate`.

In [None]:
"""Write your code for exercise-9 here:"""

university_most_liberal_accept_rate = college.loc[college['AcceptRate'].idxmax(), 'Name']

<div class='exercise'> <b> Exercise 10:</b> </div>

- Calculate correlation between out-of-state tuition and graduation rate and store it to variable named `correlation`.
- Refer to this document [pandas.DataFrame.corr](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html) on how to calculate correlation value using pandas .corr().

In [None]:
"""Write your code for exercise-10 here:"""

correlation = college['Outstate'].corr(college['Grad.Rate'])

<div class='exercise'> <b> Exercise 11:</b> </div>

- Calculate Clemson University's acceptance rate and store it to variable named `clemson_accept_rate`.

In [None]:
"""Write your code for exercise-11 here:"""

clemson_accept_rate = college.loc[college['Name'] == 'Clemson University', 'AcceptRate'].iloc[0]

# END