# Python Exercises

The following assignment is intended to help you practice Python skills.

The exercises may seem trivial to some and challenging to others -- for those of you for whom this is challenging, please email me if you need help or guidance.

Throughout the following, you may find it useful to review the Supplemental Python Module for our course and/or to consult the documentation for numpy, matplotlib, and pandas, including:
* numpy: https://numpy.org/doc/1.26/
* matplotlib's `plot` options: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html
* pandas: https://pandas.pydata.org/docs/user_guide/index.html#user-guide

In the cell below, import numpy aliased as `np`, matplotlib.pyplot aliased as `plt`, and pandas aliased as `pd`

Create a Python list of numbers and make a simple plot of the list values using `plt.plot()`

Use numpy's `linspace` method to create a numpy array of 100 evenly spaced numbers between 0 and 2 and save it to variable `x`

Create a new array `y` whose elements are equal to $x + 3x^2 - 2x^3$ for each element of `x`, and then plot $y$ vs $x$ as a scatter plot with blue squares for markers.

Use `np.random.normal` to create a numpy array of 100 normally distributed numbers with mean 0 and standard deviation 0.2, and save it to the variable `x_noise`

* [Documentation for np.random.normal](https://numpy.org/doc/1.26/reference/random/generated/numpy.random.normal.html)

We'll double check that the mean is actually close to 0 and the standard deviation is close to 0.2.

Calculate the average of the values in `x_noise` by iterating over `x_noise` with a `for` loop, summing up all the values, and then dividing by 100.

Confirm that this gives the same result as calculating the average value with numpy's `average` method.
* [Documentation for np.average](https://numpy.org/doc/1.26/reference/generated/numpy.average.html)

Calculate the standard deviation of the values in `x_noise` by iterating over `x_noise` with a `for` loop, summing the squared differences between the `x_noise` values and the average value, dividing the total by 100, and taking the square root (with `np.sqrt()`).

Confirm that this gives the same result as calculating the standard deviation with numpy's `std` method.
* [Documentation for np.std](https://numpy.org/doc/1.26/reference/generated/numpy.std.html)

Plot a histogram of `x_noise` values with 20 bins.
* [Documentation for plt.hist](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html)

In the code cell below:
* Make a new numpy array `x2` that has 100 evenly spaced values between -0.6 and 0.6 
* The Normal (Gaussian) Distribution Function is given by: $$y(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-(x - \mu)^2/(2\sigma^2)}$$
Make another numpy array `y2` that has the values given by the Normal Distribution Function for each element of `x2` and with mean $\mu = 0$ and standard deviation $\sigma = 0.2$.
  * `np.pi`, `np.sqrt()`, and `np.exp()` may all be useful
* Plot a histogram of `x_noise` values with 20 bins again, but this time include a line plot on top of it that plots `y2` vs `x2`


It may look like the height of the histogram bins are very large in comparison with the line plot.  Why?

Consult the [histogram documentation](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html) and find the parameter that will allow you to plot a normalized histogram, i.e., one for which the area under the curve is equal to 1.  

Repeat the process of plotting the histogram and line together, but now plot a normalized histogram.  (This should give a better fit between the line and the histogram).

Returning to our initial `x` and `y` arrays (with $y = x + 3x^2 - 2x^3$), we're going to make a new plot:
* Plot `y` vs `x` as a red line
* Define `y_noise` to be $x + 3x^2 - 2x^3 + x_{noise}$
* On the same figure as the line plot, plot `y_noise` vs `x` as a scatter plot with blue squares for markers
* Label the horizontal and vertical axes
* Add a title
* Change the fontsize for the axes labels and titles

In this plot, the scatter points should look like they are noisily distributed close to the line.

The above does not represent doing regression, but we could nevertheless use the data to calculate regression metrics.

In [None]:
# Execute this cell to see what scikit-learn's r2score method gives for an R^2 score:
from sklearn.metrics import r2_score
r2_score(y_noise, y)

Write a Python function for manually calculating the $R^2$ score:
* the function should have two input parameters, `a` and `b`
  * `a` and `b` will be numpy arrays in our case
  * hypothetically `a` could be our data (like the blue squares above) and `b` could be our predictions (like the line above)
  * to explain the calculation, I'll denote the elements of `a` as [$a_0$, $a_1$, ... $a_n$] and of `b` as [$b_0$, $b_1$, ... $b_n$]
* the body of the function should return the value of $R^2$ by calculating the following
  * $R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$
  * $SS_{res} = \sum_i (a_i - b_i)^2$
    * the sum of the squared differences between elements of `a` and elements of `b`
  * $SS_{tot} = \sum_i (a_i - \bar{a})^2$
    * the sum of the squared differences between elements of `a` and the average of `a`
  * $\sum_i$ represents a sum over all array elements
  * $\bar{a}$ is the average of `a`'s values
* you should not use any numpy library methods inside the function, only basic Python operations

Use your new function to calculate the $R^2$ score for `y_noise` and `y`.  (You should get the same result as `sklearn.metrics.r2_score`)

# Pandas section

Execute the cell below to import the "Adult" dataset (also known as the "Census Income" dataset) from UCI's Machine Learning repository (https://archive.ics.uci.edu/dataset/2/adult).

This dataset is an example that can be used to predict whether income exceeds $50K/yr based on census data. We will use this dataset in next week's assignment as well, but here are just doing a couple simple exercises to get started with importing the data and working with it as a Panda's dataframe.

NOTE: If you run the cell and see the following error `ModuleNotFoundError: No module named 'ucimlrepo'`, then you will need to install `ucimlrepo`
  * You can do this directly in the Jupyter notebook by adding a new cell and executing `!pip install ucimlrepo`, where the `!` allows you to execute shell commands rather than Python code

In [None]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
adult = fetch_ucirepo(id=2) 
  
# data (as pandas dataframes) 
X = adult.data.features
X['income'] = adult.data.targets

`X` is now a dataframe.  Use the `info` method to output summary information about it.

Use Python code to output answers to the following questions.

What are the number of rows and columns in the dataframe?

What are the column names of the dataframe?

Output the first 2 rows of the dataframe.

Output summary statistics about the dataframe using the `describe` method.

What are the unique values in the `age` and `race` columns?

Plot a histogram of values in the `age` column

Plot a bar graph that shows the number of elements for each unique value in the `sex` column.

## Submit

* Save your work (File -> Save Notebook)
* Verify that your notebook runs without error by restarting the kernel (or closing and opening the notebook) and selecting the top menu item for Run -> Run All Cells.  It should run successfully all the way to the bottom.
* Save your notebook again.  Keep all the output visible when saving the final version.
* Submit the file through the Canvas Assignment.