# <img style="float: left; padding-right: 10px; width: 45px" src="https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/iacs.png"> CS109A Introduction to Data Science 

## Homework 0: Find the mystery number
### <span style='background :yellow' >(UPDATED 08/20/20 @ 6pm EST)</span>

**Harvard University**<br/>
**Fall 2020**<br/>
**Instructors**: Pavlos Protopapas, Kevin Rader, and Chris Tanner


<hr style='height:2px'>

---

In [None]:
## RUN THIS CELL TO PROPERLY HIGHLIGHT THE EXERCISES
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2019-CS109B/master/content/styles/cs109.css").text
HTML(styles)

## Welcome to CS109a!

This homework is meant to assess your level of comfort with the prerequisites for this class. If you find that the following exercises, either the coding ones or the math ones, are hard for you might want to wait another year before taking CS109a.  Another goal of this homework is for you to set up the necessary programming environment.

Also check [Preparing for this course](https://harvard-iacs.github.io/2020-CS109A/pages/preparation.html) for more details on how to better prepare yourself for CS1009A. 

In this homework, you are looking for a mystery number! It will be calculated by starting with a value of 0 and adding the results of calculations as you go along.



In [None]:
# initialize mystery 
mystery = 0

<div class='exercise'> <b> Exercise 1: Python </b></div>

Solid programming experience, not necessarily in Python, is a prerequisite for this class. There will not be enough time to learn programming as well as the content of this class once the semester starts. Students who have attempted such reported feeling overwhelmed, and it can drastically affect your pleasure in learning the core concepts of the course. If you have programming experience, but not with Python, you will need to pick up the basics of Python on your own before the start of the semester. 

#### One good starting point is the classic [Python Tutorial](https://docs.python.org/3/tutorial/).

### 1.1 Accessing the class material

All class material, except homework, will be in the class GitHub repository. Clone this repository and then copy the contents in a different directory so you can make changes.

* Open the Terminal in your computer and go to the directory where you want to clone the repo. Then run 
```
git clone https://github.com/Harvard-IACS/2020-CS109A.git`
```
* If you have already cloned the repo, go inside the '2020-CS109A/' directory and run 
```
git pull
```

* If you change the notebooks and then run ```git pull``` you will get a "merge conflict" error and the pull will fail.

One way to deal with this is to create a `playground/` folder and copy the folder with the notebook with which you want to work there.

A quick tutorial in git can be found [here](https://github.com/NREL/SAM/wiki/Basic-git-tutorial).

### 1.2 Virtual environments and `.yml` files.

Before you do any installation, **create a virtual environment.** We cannot stress this enough!
 
Isolating your projects inside specific environments helps you manage software dependencies, allowing you to have different versions of packages in each environment. This is way easier than having to make sure all installed packages are compatable across all your projects. This way you can recover from serious mess-ups by deleting an environment and not having to worry about how it affects your entire system.
 
The two most popular tools for setting up environments are:

- `conda` (a package and environment manager) 
- `pip` (a Python package manager) with `virtualenv` (a tool for creating environments)

We recommend using `conda` package installation and environments. `conda` installs packages from the Anaconda Repository and Anaconda Cloud, whereas `pip` installs packages from PyPI. Even if you are using `conda` as your primary package installer and are inside a `conda` environment, you can still use `pip install` for those rare packages that are not included in the `conda` ecosystem. 

See here for more details on how to manage [Conda Environments](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html).

#### What is a `.yml` file? 

It's a file that lists all the necessary libraries we will need in this class bundled together. See exercise below on how to create an environment using a `.yml` file.

### 1.3  Getting started!

In 1.1, you should have cloned the 2020-CS109A git repository in your local machine.

Now, let's use the cs109a.yml file to create a virtual enviroment:</div>

``` 
$ cd /2020-CS109A/content/HW0/
$ conda env create -f cs109a.yml
$ conda activate cs109a
```   
The `cs109a.yml` file includes all the packages that you will need for the course. It should be in the same directory as this notebook. 

### 1.4  Importing and checking for all the necessary libraries: 

#### Besides being open source, Python is very flexible and many highly specialized libraries have been written for it (which you can import and use):

- One such library is **numpy**: [https://numpy.org](https://numpy.org). We will cover it in class but you are encouraged to look at the documentation to learn the basics. 

- Another very important library is **matplotlib**: [https://matplotlib.org](https://matplotlib.org), and its sister `seaborn`: [https://seaborn.pydata.org](https://seaborn.pydata.org). 

- We will be using Jupyter notebooks: [https://jupyter.org](https://jupyter.org). Jupyter notebooks can be nicelly used within the JupyterLab environment (for more details see link provided).

- Data wrangling: **pandas**: https://pandas.pydata.org

- Machine learning algorithms: **sklearn** - [https://scikit-learn.org/stable/](https://scikit-learn.org/stable/)

- Scientific formulas: **scipy** - [https://www.scipy.org](https://www.scipy.org)

- Packages for statistical analysis: **statsmodels** - [https://www.statsmodels.org/](https://www.statsmodels.org/)<br>
 statsmodels examples: https://www.statsmodels.org/stable/examples/index.html#regression<BR>

For the neural networks part of the course, we will use TensorFlow and Keras

- TensorFlow (https://www.tensorflow.org) is a framework for representing complicated ML algorithms and executing them in any platform, from a phone to a distributed system using GPUs. Developed by Google Brain, TensorFlow is used very broadly today. 

- Keras (https://keras.io/) is a high-level API used for fast prototyping, advanced research, and production. We will use `tf.keras` which is TensorFlow's implementation of the `keras` API.


**Note**: The `.yml` file we provided you can install all these libraries for you.

<div class='exercise'> <b> Exercise 2:  Run the following five cells of code: </b></div> 

In [None]:
# See the "import ... as ..." contructs below: 
# they're aliases/shortcuts for the package names. As a result, 
# we can call methods such as plt.plot() instead of matplotlib.pyplot.plot()

import random
import numpy as np
import scipy as sp
import pandas as pd
import scipy.stats
import matplotlib.pyplot as plt

# The line starting with % is a jupyter "magic" command, 
# and is not part of the Python language.
# In this case we're just telling the plotting library to draw things on
# the notebook, instead of on a separate window.

%matplotlib inline

In [None]:
from sklearn import datasets
iris = datasets.load_iris()
digits = datasets.load_digits()

In [None]:
# update mystery
mystery = mystery + digits.target[2]

In [None]:
from scipy import misc

face = misc.face()
plt.imshow(face)
plt.show() # you should see a racoon

In [None]:
# update mystery
mystery = mystery + face[3][880][0]
mystery

In [None]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Load data
dat = sm.datasets.get_rdataset("Guerry", "HistData").data

# update mystery
mystery = mystery + int(dat.Literacy.iloc[1]/50)
print(mystery)

dat.head()

In [None]:
from __future__ import absolute_import, division, print_function, unicode_literals

# TensorFlow and tf.keras
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras import models
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential
from tensorflow.keras.regularizers import l2

tf.keras.backend.clear_session()  # For easy reset of notebook state.

print(tf.__version__)  # You should see a >2.0.0 here!
print(tf.keras.__version__)

mystery = mystery +  int(str(tf.keras.__version__).split(".")[0])
mystery

<div class='exercise'> <b> Exercise 3:  Let's write some Python code! </b></div>

**3.1**: Write a function to calculate the first `num` Fibonacci Numbers (`num` will be an integer), and put them in a list. Then **run** your function to create the list for `num=50`. Use `assert` to check if the `num` passed in the argument of the function is an integer and throw an error if it's not. 

**Note:** we encourage you to write this from scratch. Create a Python list with the first 2 Fibonacci Numbers by hardcoding them: [0, 1], and then append the rest of the elements of the list by using the previous too.
    
You **may not** use `numpy`; use Python lists.


In [None]:
def fib(num):
    """(This is called a docstring, contains info about the function)
       This function calculates the first `num` numbers in the 
       Fibonacci sequence.
       ----
       Arguments: the number of numbers to create
       ----
       Returns: a list containing those numbers
    """
    assert (type(num) is int), "num must be an integer!"
    
    fib_list = [0,1]
    i = 0
    while i <= num:
        fib_list.append(fib_list[i]+fib_list[i+1])
        i+=1
    
    
    return fib_list

**3.2**: Generate the first 50 Fibonacci numbers and name it `fibonaccis`.

In [None]:
fibonaccis = fib(50)


**3.3**: Using [list slicing](https://docs.python.org/3/tutorial/introduction.html#lists), reverse the list.

In [None]:
fibonaccis = fibonaccis[::-1]

**3.4**: Python lists: write code to add the $45^{th}$ element to the mystery number

In [None]:
mystery = mystery + fibonaccis[45]

**3.5**: Check to see if the number 34 is within your `fibonacci` list. If 34 is within your `fibonacci` list, add 1 to the `mystery` number (1 represents True). If 34 isn't within the list, do not add anything to the mystery number.

In [None]:
if 34 in fibonaccis:
    mystery += 1

**3.6:** Fill in the `.sample()` input parameters below to generate 10,000 random numbers in [1, 1,000,000).

In [None]:
random.seed(a=13, version=2)
random_numbers = random.sample(range(1,1000000), 10000) # your code within the ( ) 

**3.7:** Add to the `mystery` number the first digit of the first item in `random_numbers`. For example, if the first item in `random_numbers` is 982, then add 9 to the mystery number.

In [None]:
print(str(random_numbers[0]))
mystery = mystery + int(str(random_numbers[0])[0])
mystery

Writing tons of code in a linear fashion can only get you so far. Functions are incredibly useful, as they help us modularize, organize, and reuse pieces of our code. It's important to understand how functions work and the scope of each variable. For example, in the function below, `func()`, we demonstrate that the function has its own local scope. That is, `func()` has its own variable called `num`, which is independent from the `num` that exists outside of the function.

In [None]:
def func(num: int):
    num = num*2
    
num = 3
func(num)
print(num)

However, functions can also return values, which allow code outside of functions to use said values.

In [None]:
def func2(num: int):
    num = num*2
    return num

num = 3
num = func2(num)
print(num)

**3.8**:  Write a function `pair_exists()` that takes two items as input:
- `nums`: a list (or other data structure)
- `target`: an integer

The function should return **True** if any two numbers within `nums` sum up to `target`. Otherwise, return **False**. Note, you are not restricted to using a list (e.g., could use a Set or Hash/Dictionary structure). Think about the tradeoffs of transforming `random_numbers` to other data structures.

In [None]:
def pair_exists(nums, target, verbose = False):
    first_num = 0
    second_num = 1
    
    while first_num < len(nums)-1:
        
        while second_num < len(nums):
            
            if nums[first_num]+nums[second_num] == target:
                if verbose:
                    print('1st index: ' + str(first_num))
                    print('1st number: ' + str(nums[first_num]))
                    print('2nd number: ' + str(nums[second_num]))
                    print('2nd index: ' + str(second_num))
                    print('sum: ' + str(nums[first_num]+nums[second_num]))

                return(True)
            
            
            second_num += 1
            
        first_num += 1
        second_num=1+first_num

    
    return(False)



In [None]:
print(pair_exists(random_numbers, 38109)) # SHOULD RETURN TRUE
print(pair_exists(random_numbers, 13538)) # SHOULD RETURN FALSE

In [None]:
print(pair_exists(random_numbers, 13538, verbose = True))

Generate 10 random numbers in [0, 1000000) using the code below: 


In [None]:
random.seed(a=12, version=2)
target_numbers = random.sample(range(100000), k=10)
num_found = 0

**3.9**: Write code that executes your `pair_exists()` function, once for each number within the `target_numbers` list. How many of these values return **True**? Add this to the mystery number (e.g., if 5 of the numbers cause `pair_exists()` to return True, then add 5 to the mystery number).

In [None]:
for number in target_numbers:
    num_found += pair_exists(random_numbers, number)

Dictionaries in Python (generally referred to as HashMaps) are the most powerful, commonly-used data structure, as they allow for non-flat structure and incredibly fast access time. They consist of key-value pairs. You should feel comfortable using a dictionary in Python. 

As an example, generations within the western world are typically defined as follows:

- Lost Generation: 1883-1900
- Greatest Generation: 1901-1927
- Silent Generation: 1928 - 1945
- Baby Boomers: 1946-1964
- Generation X: 1965-1980
- Millennials: 1981-1996
- Generation Z: 1997-2012


**3.10** Create a dictionary named `generations` where the keys are the generation names (e.g., 'Silent Generation') and the values are *starting years* for each generation. For example, one key-value pair would be: `{'Silent Generation' : 1928}`

In [None]:
generations = {'Lost Generation': 1883,
'Greatest Generation': 1901,
'Silent Generation': 1928,
'Baby Boomers': 1946,
'Generation X': 1965,
'Millennials': 1981,
'Generation Z': 1997}

In [None]:
mystery = mystery + int(generations['Baby Boomers'])/973
mystery

Below, `possible_gens` contains several items. 

In [None]:
possible_gens = [ 'The Lost Generation', 'Zoomers', 'Generation X', 'Generation Y',
                 'Fitbiters',  'Clickbaiters','Zipliners', 'Lipliners',
                 'Ghostbusters', 'MythBusters', 'Smack Talkers', 'Snack Eaters',
                 'Aarvarks', 'Baby Boomers', 'Silent Wavers', 'Earth Shakers',
                 'Ground Breakers', 'The Greatest Generation', 'Silent Generation',
                 'Salsa Dancers', 'Horse Riders', 'Millennials', 'Castle Dwellers',
                 'Chain Smokers', 'Rain Makers', 'Generation Jay Z', 'Sun Bathers']


**3.11**:  Write some code that adds 1 to `mystery` for each `possible_gens` item that is also a key within the `generations` dictionary. For example, if 2 of those items (e.g., Zoomers and Babies) are both keys in `generations`, then add 2 to the mystery number.

In [None]:
for name in possible_gens:
    mystery += name in generations.keys()

<div class='exercise'> <b> Exercise 4: Matrix Operations</b></div>
    
Complete the following matrix operations by hand (show your work as a markdown/latex notebook cell).

**4.1.** Let $ A =  \left( \begin{array}{ccc}
3 & 4 & 2 \\
5 & 6 & 4 \\
4 & 3 & 4 \end{array} \right) \,\,$ and  $ \,\, B = \left( \begin{array}{ccc}
1 & 4 & 2 \\
1 & 9 & 3 \\
2 & 3 & 3 \end{array} \right)
$

Compute $C=A \cdot B$ and add the `C[0,2]` value to `mystery`. 

C[0,2] = 3 * 2 + 2 * 4 + 2 * 3

C[0,2] = 6+8+6

C[0,2] = 20 

etc.

Whole matirx is

$ \,\, C = \left( \begin{array}{ccc}
3+4+4 & 12+36+6 & 6+12+6\\
5+6+8 & 20+54+12 & 10+18+12\\
4+3+8 & 16+27+12 & 8+9+12\end{array} \right)
$ 

$ \,\, C = \left( \begin{array}{ccc}
11 & 54 & 24\\
19 & 86 & 40\\
15 & 55 & 29\end{array} \right)
$ 

In [None]:
A = np.array([[3,4,2], [5,6,4], [4,3,4]])
B = np.array([[1,4,2], [1,9,3], [2,3,3]])
C = A.dot(B)
mystery = mystery + C[0,2]
mystery 

**4.2.** Let
$$ A =  \left( \begin{array}{ccc}
0 & 12 & 8 \\
1 & 15 & 0 \\
0 & 6 & 3 \end{array} \right)$$  

Compute $C = A^{-1}$ and add the value of `C[1,1]` to `mystery`. 

Used linear row reduction:
$$ C =  \left( \begin{array}{ccc|ccc}
    0 & 12 & 8  & 1 & 0 & 0\\
    1 & 15 & 0 & 0 & 1 & 0 \\
        0 & 6 & 3 &0 & 0 & 1\end{array} \right)$$  
        
First reduce the top row
$$ C =  \left( \begin{array}{ccc|ccc}
0 & 0 & 1 & 0.5 & 0 & -1 \\
1 & 15 & 0 & 0 & 1 & 0 \\
0 & 6 & 3 & 0 & 0 & 1\end{array} \right)$$  

Next the bottom row
$$ C =  \left( \begin{array}{ccc|ccc}
0 & 0 & 1 & 0.5 & 0 & -1 \\
1 & 15 & 0 & 0 & 1 & 0 \\
0 & 1 & 0 & -0.25 & 0 & 0.67\end{array} \right)$$  

Then the middle row
$$ C =  \left( \begin{array}{ccc|ccc}
0 & 0 & 1 & 0.5 & 0 & -1 \\
1 & 0 & 0 & 3.75 & 1 & -10 \\
0 & 1 & 0 & -0.25 & 0 & 0.67\end{array} \right)$$  

Then put everything in the right order
$$ C =  \left( \begin{array}{ccc|ccc}
1 & 0 & 0 & 3.75 & 1 & -10 \\
0 & 1 & 0 & -0.25 & 0 & 0.67 \\
0 & 0 & 1 & 0.5 & 0 & -1 \end{array} \right)$$  

In [None]:
A = np.array([[0,12,8],[1,15,0],[0,6,3]])
C = np.linalg.inv(A)
mystery = mystery + C[1,1]

<div class='exercise'> <b> Exercise 5: Basic Statistics </b></div>

**37** of the **76** freshman CS concentrators in University of X have taken CS109a while **50** of the **133** sophomore concentrators haven taken CS109a.  

Use the $z$-test for 2 proportions to determine if interest in Data Science (measured as wanting to take CS109a) is related to the two groups.  

$$z = \frac{\hat{p}_1-\hat{p}_2}{\sqrt{\hat{p}_{pooled}(1-\hat{p}_{pooled})\left(\frac{1}{n_1}+\frac{1}{n_2}\right)}}$$

Where $n_1, n_2$ are the total population for each year. $\hat{p}_1$ is the ratio of freshmen CS concentrators to the the total number of freshmen.  $\hat{p}_2$ is this ratio but for sophomores students. $p_{pooled}$ is the ratio of all CS concentrators to all students.


    
Add the result into variable `z`.

In [None]:
n1 = 76
n2 = 133
p1 = 37/n1
p2 = 50/n2
p_p = (37+50)/(n1+n2) 
z = (p1-p2)/np.sqrt(p_p*(1-p_p)*(1/n1 + 1/n2))

In [None]:
# update mystery
mystery -= 16*int(z)

### And the Mystery Number is:

In [None]:
print(f'Mystery number is: {mystery} !!')