# W200 Python Fundamentals for Data Science, UC Berkeley MIDS
# Final Exam


## Instructions
The final exam is designed to evaluate your grasp of Python theory as well as Python coding.

- This is an individual exam.
- You have 24 hours to complete the exam, starting from the point at which you first access it.
- You will be graded on the quality of your answers.  Use clear, persuasive arguments based on concepts we covered in class.
- While we've left one code/markdown cell for you after each question as a placeholder, some of your answers will require multiple cells to fully respond
- Double click the markdown cells where it says YOUR ANSWER HERE to enter your written answers; if you need more cells for your written answers, please make them markdown cells (rather than code cells)

## YOUR NAME HERE
Grace Yuqing Lin

## 1: General Questions (21 pts )

a) The following method is part of a larger program used by a mobile phone company.  It will work when an object of type MobileDevice or of type ServiceContract is passed in.  This is a demonstration of (select all that apply and state a reason why it applies):

    1. Inheritance
    2. Polymorphism
    3. Duck typing
    4. Top-down design
    5. Functional programming

In [5]:
# Method:

def add_to_cart(item):
    cart.append(item)
    total += item.price

- a) Your answer here

This specific code is a demonstration of Duck Typing. As it work both on type MobileDevice and type ServiceContract. The object's suitability is determined by the presence of certain methods and properties, rather than the type of the object itself. 


b) Suppose you have a long list of digits (0-9) that you want to write to a file.  Would it be more efficient to use ASCII or UTF-8 as an encoding?  How could you create an even smaller binary file to store the information?

- b) Your answer here

It would be equivalently efficient to use ASCII and UTF-8 because they use the exact same bytes for a list of digits. However, ASCII is preferred if we want to create a smaller binary file as it can be compressed.


c) You are part of a team working on a spreadsheet program that is written in Python 3.  The program includes several classes to represent different types of objects that fit into a cell of a spreadsheet.  Give a strong argument for why your team should write an abstract base class to represent such objects and give examples of what should go into such an abstract base class.

- c) Your answer here

We need abstract base class because it offers an intermediate solution between the free-form of Python and a staticly-typed language. In a teamwork setting, abstract base class can make sure every one is on the same page implementing those subclasses. It can define shared API for a set of subclasses. In addition, if we don't implement all necessary methods (and properties), we will get an error upon instantiation, rather than an AttributeError, potentially much later. Some examples could be insert() and delete() functions with common properties like color and fonts. 


d) Explain why NumPy is better than lists for "vectorized" math operations. Give an example of an operation that is either impossible or painful to implement using traditional Python lists compared to NumPy arrays.

- d) Your answer here

Lists cannot be carried out by efficient C loops, each iteration would require type checks and other Python API bookkeeping. Numpy allows fast parallel computation at several levels: Vector or array operations, which allow to execute similar operations simultaneously on a bunch of data. Through the following example, we can see numpy is much faster than list. 

In [51]:
# if we multiply two sequences with a list comprehension: 
import random
a = [random.randint(1, 100) for i in range(100000)]
b = [random.randint(1, 100) for j in range(100000)]
%timeit result = [x * y for x, y in zip(a, b)]


20.3 ms ± 3.53 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [52]:
# it's much faster and easiler to understand if we use numpy:
import numpy as np
a = np.random.randint(1, 100, 100000)
b = np.random.randint(1, 100, 100000)
%timeit result = a * b



159 µs ± 15.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


e) We want a list of the numbers that are the square of nonnegative integer less than 10, but whose squares are greater than 10.  The list comprehension below gives an empty list.  Correct it so that we get the desired output: [16, 25, 36, 49, 64, 81].

In [7]:
[x**2 for x in range(10) if x**2 > 10]

[16, 25, 36, 49, 64, 81]

f) Explain why the following code prints what it does.

In [7]:
def f(): pass
print(type(f))

<class 'function'>


- f) Your answer here

f is a function we just defined with no specific instructions. The print(type(f)) just shows the type of f, which is a function.

g) Explain why the following code prints something different.

In [8]:
def f(): pass
print(type(f()))

<class 'NoneType'>


- g) Your answer here

The print(type(f())) just shows the type of the result of f, which is an empty value.

## 2: Data Integrity (25 pts)

a) Why is it important to sanity-check your data before you begin your analysis? What could happen if you don't?

- a) Your answer here

If the data has some unreasonable values, it could impact our result. For example, if we are calculating average age, and some records has negative ages, it would negatively impact the average age we calculate. 

b) Explain, in your own words, why real-world data is often messy.

- b) Your answer here

Because the data types could be very different. For example, if we want to use datetime, there are so many different formats for such information. We need to normalize such features first to be able to run analysis.

c) How do you determine which variables in your dataset you should check for issues prior to starting an analysis? 

- c) Your answer here

We can use describe function to check the data and have a summarized view. If we see any extrem values, we should check further. In addition, if we know the data based on common sense, we can investigate even more. For example, if we are testing colors, and use set function to find unique color values, find "unbrella", this could also be an error.

d) How do you know when you have adequately checked these variables?

- d) Your answer here

We can draw graphs to see if those variables look reasonable. We can write specific test codes to run tests.

e) Is it possible to fully vet your data for errors before you begin your analysis? If not, what should you be looking out for while you complete your analysis?

- e) Your answer here

It's not possible to fully vet the data before begin the analysis under most circustances. There could be some type errors or miscellaneous errors showing up during the anlaysis. We should look for data consistency and look out for extreme values/missing values/type errors, etc while we complete the analysis.

## 3:  Elections (24 pts)

Consider the following data frame in Pandas.

In [25]:
import pandas

# creating a data frame from scratch - list of lists

data = [ ['marco', 165, 'blue', 'FL'], 
         ['jeb', 0, 'red', 'FL'], 
         ['chris', 0, 'white', 'NJ'], 
         ['donald', 1543, 'white', 'NY'],
         ['ted', 559, 'blue', 'TX'],
         ['john', 161, 'red', 'OH']
       ]

# create a data frame with column names - list of lists

col_names = ['name', 'delegates', 'color', 'state']
df = pandas.DataFrame(data, columns=col_names)
df

Unnamed: 0,name,delegates,color,state
0,marco,165,blue,FL
1,jeb,0,red,FL
2,chris,0,white,NJ
3,donald,1543,white,NY
4,ted,559,blue,TX
5,john,161,red,OH


a) Using bracket indexing in Pandas, show how many delegates `ted` got.

In [59]:
# a) Your answer here
df.index = df.name
df.loc['ted'].delegates

559

b) Using bracket indexing in Pandas, show how many total delegates were obtained by candidates whose favorite color is blue.

In [61]:
# b) Your answer here
df.index = df.color
df.loc['blue'].delegates.sum()

724

c) Using groupby and aggregate in Pandas, show how many total delegates were obtained by candidates grouped by favorite color.

In [39]:
# c) Your answer here
print(df.groupby('color').agg({'delegates': 'sum'}))


       delegates
color           
blue         724
red          161
white       1543


## 4: Clinical disease data (30 pts)

Your boss comes to you Monday morning and says “I figured out our next step; we are going to pivot from an online craft store and become a data center for genetic disease information! I found **ClinVar** which is a repository that contains expert curated data, and it is free for the taking. This is a gold mine! Take a week and tell me what gene and mutation combinations are classified as dangerous.”

1)  Look at the sample data set (in the Sample ClinVar data below or in the .txt file) and develop a plan of action to use python to extract and summarize just what your boss wants. **Don’t code**. You can use pseudocode and/or and essay format to generate a plan in 500 words or less. 

2) Tell us the output that you expect from your planned code

**Hints:**  

* Look at the sample file carefully. What fields do you want to extract? Are they in the same place every time? What strategy will you use to robustly extract and filter your data of interest? How do you plan to handle missing data?

* Filter out junk. Just focus on what your boss asked for (1) gene name (2) mutation reference. (3) Filter your data to include only mutations that are dangerous as you define it. 

* Pandas and NumPy parsers correctly recognize the end of each line in in the ClinVar file.

* The unit of observation of this dataset is one row per mutation.

* While you shouldn't code your analysis, creating a few lines of code while you think through the problem may be helpful (so that you can sanity check that your plan works). So you can experiment, we have included the data file below as a Tab Separated Value file "Genomics_Questions.txt". Please do not submit any such code. For example, if I wanted to check that I accurately understand the "split" function in the context of this data, I could type:

```python
sample = "abc;def;asd"
test = sample.split(';')
```

**This is a planning question we want you to lay out a plan in text not code.** 

### VCF file description (Summarized from version 4.1)


* The VCF specification:

VCF is a text file format which contains meta-information lines, a header
line, and then data lines each containing information about a position in the genome. The format also has the ability to contain genotype information on samples for each position.

* Fixed fields:

There are 8 fixed fields per record. All data lines are **tab-delimited**. In all cases, missing values are specified with a dot (‘.’). 

1. CHROM - chromosome number
2. POS - position DNA nuceleotide count (bases) along the chromosome
3. ID - The unique identifier for each mutation
4. REF - reference base(s) alleles
5. ALT - alternate base(s) alleles
6. QUAL - Phred scaled quality score
7. FILTER - filter status (if position has passed all filters)
8. INFO - a semicolon-separated series of  keys with values in the format: <key>=<data>, and specified as <key>=<data name>[data value definition].


### INFO field specifications

```
GENEINFO = <Gene symbol>
CLNSIG =  <Variant Clinical Significance (Severity)
  0 – unknown	(Uncertain significance)
  1 – untested	(not provided)
  2 - non-pathogenic	(Benign)
  3 - probable-non-pathogenic	(Likely benign)
  4 - probable-pathogenic	(Likely pathogenic)
  5 – pathogenic	(Pathogenic)
  6 - drug-response	(drug response)
  7 – histocompatibility	(histocompatibility)
  255 - other	(other)
```

### Representative/Sample ClinVar data (vcf file format)

```
##fileformat=VCFv4.0							
##fileDate=20160705							
##source=ClinVar and dbSNP							
##dbSNP_BUILD_ID=147							
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
1	949523	rs786201005	C	T	.	.	GENEINFO=ISG15;CLNSIG=5
1	949696	rs672601345	C	CG	.	.	GENEINFO=ISG15;CLNSIG=5;CLNDBN=Cancer
1	949739	rs672601312	G	T	.	.	GENEINFO=ISG15;CLNDBN=Cancer
1	955597	rs115173026	G	T	.	.	GENEINFO=AGRN;CLNSIG=2; CLNDBN=Cancer
1	955619	rs201073369	G	C	.	.	GENEINFO=AGG;CLNDBN=Heart_dis 
1	957640	rs6657048	C	T	.	.	GENEINFO=AGG;CLNSIG=3;CLNDBN=Heart_dis 
1	976059	rs544749044	C	T	.	.	GENEINFO=AGG;CLNSIG=0;CLNDBN=Heart_dis 
```

A second version of this file is provided as a .txt file in case you want to load it into your console to test it out. You can use either file for the data modeling.

##### Your answer here (use as many cells as you need!)

Since we are only focused on genes and mutations, I would like to extract ID and INFO. I will build my own filter based on data quality to obtain the experimental data set. 

I will convert and normalize the fields. For example, I will summarize the describe the numeric data to see if there is any extreme values, such as negative number in CLNSIG. 

I will seperate the info file into different columns based on semicolon. In this case, because some observations have CLNDBN  and some do not, making the data sometimes not in the same place. We need to map those data into correct column names. 

In terms of missing data, if it is in the fields of importance, we can exclude these observations. If they are in the fields we already exclude, we can still use those observations.

I will extract observations with clnsig between 4 and 7. I define those observations as "dangerous". 

Furthermore, I can start by slicing the dataset into 30% test and 70% train data. Build a model on the train data then test it on the test data. Exploring the relationship among mutation number, geneinfo, and clinsig. Using clinsig as an indictor, importing numpy, seaborn, etc libraries, to plot the correlation and heat map. Try different models (logistic regression, random forest) to fit the train data. 

I expect the output to be a clean dataset with mutation ID, gene name, and CLNSIG. In addition, there could be an input box for people to type in gene name, mutation ID, and the program could tell the possibility getting a disease. 