# Lesson14 Individual Assignment

***Individual*** means that you do it yourself. You won't learn to code if you don't ***struggle for yourself*** and  ***write your own*** code. Remember that while you can discuss the general (algorithmic) way to solve a problem, you should ***not*** even be looking at anyone else's code or showing anyone else your code for an individual assignment.  
Review the **Group Work** guidelines on Cavas and/or ask an instructor if you have any questions.

## Programming Practice

Be sure to spell all function names ***correctly*** - misspelled functions will lose points (and often break anyway since no one is sure what to type to call it). If you prefer showing your earlier, scratch work as you figure out what you are doing, please be sure that you make a final, complete, correct last function in its own cell that you then call several times to test. In other words, separate your thought process/working versions from the final one (a comment that tells us which is the final version would be lovely).

Every function should have ***at least*** a docstring at the start that states what it does (see _Lesson3 Team Notebook_ if you need a reminder). Make other comments as necessary.  

Make sure that you are running ***test cases (plural)*** for everything and commenting on the results in markdown. Your comments should discuss how you know that the test case results are correct.

## part 1:  preparation

Don't forget that every new notebook needs to import the packages and libraries that you need (with hte standard nicknames).

In [6]:
## don't forget to import and nickname


You should have exported a cleaned and tidied version of the `gapminder` DataFrame called `gapminder_CandT.csv` at the end of the `Team` activity.

***write code*** to import it into this notebook assigned to the identifier `gapminder_CandT` (use this name please).  (_hint_ use `pd.read_csv()` method)

In [None]:
## your code to import the local file
## do not change the name!




***write code*** to make sure that it worked

In [None]:
# code to make sure it looks like the example below


The first 5 rows of your `gapminder_CandT` DataFrame should look like this:  

|year|pop|lifeexp|gdppercap|country|continent
---|---|---|---|---|---
**0**|1952|8425333|28.801|779.445314|afghanistan|asia
**1**|1957|9240934|30.332|820.853030|afghanistan|asia
**2**|1962|10267083|31.997|853.100710|afghanistan|asia
**3**|1967|11537966|34.020|836.197138|afghanistan|asia
**4**|1972|13079460|36.088|739.981106|afghanistan|asia

**1\. Comment on the correct code to accomplish this import - what is new? what is not? what is important?**

## part2: merging data

Often we have more than one DataFrame that contains parts of our data set and we want to put them together. This is known as merging the data.

Now we want to add a data for a new country called The People's Republic of Berkeley to the `gapminder` data set that we have cleaned up. Our goal is to get this new data into the same DataFrame in the same format as the gapminder data and, in this case, we want to concatenate (add) it onto the end of the gapminder data.

**Concatentating** is a simple form of merging, there are many useful (and more complicated) ways to merge data.  If you are interested in more information, the [Pandas Documentation](http://pandas.pydata.org/pandas-docs/stable/merging.html) on merging is useful.

Download the People's Republic of Berkeley data (`PRB_data.csv`) from Canvas.

***write code*** to read in the People's Republic of Berkeley data (`PRB_data.csv`) and assign it to the identifier `PRB`.

In [1]:
## read in PRB data and assign it to PRB

## look at the top 5 rows


**Oh snap!** th `PRB` data is in the original format!   

You can't merge the `PRB` and `gapminder_CandT` DataFrames until they have the data you want in the same column structures, same labels, etc.

**You have 2 jobs in this section:**  
+ make sure there are no duplicates or NAs in the `PRB` data
+ Clean and tidy it so that it has the same format as the `gapminder_CandT` data (columns, column names, case of text, etc.)

***Note:*** this should take you several steps. Make sure to leave all of the code and results in place for grading (this is also good practice for reproducibility)  

***write and run your code below***

In [None]:
## look for NAs and duplicates
## clean the data to look like the current gapminder



### _finally_, combine the data sets

Now that the two DataFrames have the same columns and overall formats, you can combine them.  
Check out this [link to the docs](http://pandas.pydata.org/pandas-docs/stable/merging.html) for recommendations, there are several ways to do this.  

**You have 2 jobs in this section:**  
+ merge the 2 DataFrames calling the merged version `gapminder_comb` and confirm it is merged
+ make sure the index of `gapminder_comb` is continuous (doesn't start over creating duplicates)

***write and run your code below***

In [3]:
## combine the data sets
## fix and verify index is continuous



**2\. Comment on your results.**

## part3: summarizing data

### Exploration is an iterative process

So far, we've taken raw data and worked through steps to prepare it for analysis, but we have not yet done any "data analysis".  This part of the data workflow can be thought of as "exploratory data analysis", or EDA.  Many of the steps we've shown are aimed at uncovering interesting or problematic things in the dataset that are not immediately obvious.  We want to stress that when you're doing EDA, it will not necessarily be a linear workflow.  When you plot or summarize your data, you may uncover new issues: we saw this when we made a mistake fixing the naming conventions for the Democratic Republic of Congo.  You might discover outliers, unusually large values, or points that don't make sense in your plots.  Clearly, the work here isn't done: you'll have to investigate these points, decide how to fix any potential problems, document the reasoning for your actions, and check that your fix actually worked.

On the other hand, plots and summaries might reveal interesting observations and prompt questions about your data.  You may return to the cleaning and prepping steps in order to dig deeper into these questions.  You should continuously refine your plots to give the clearest picture of your hypotheses. Remember, **exploration is an iterative process**.

Remember that the `info()` method gives a few useful pieces of information, including the shape of the DataFrame, the variable type of each column, and the amount of memory stored. We can see many of our changes (continent and country columns instead of region, higher number of rows, etc.) reflected in the output of the `info()` method.

We also saw before that the `describe()` method will take the numeric columns and give a summary of their values. 

***write and run code*** to get `info()` and `describe()` the `gapminder_comb` data.

In [None]:
## info and describe



The two methods above are very easy to use, but don't give us all of the info that we might want.

### More summaries

Now play around with some of the summary functions that you used before or are on the cheat sheet.

In [None]:
## your code and results for summaries


What if we wanted a new DataFrame that just contained summary info? This could be a table in a report. For example what if we wanted to know the number of countries per continent, the mean and median population for a country on each continent, and the mean and median GDP for a country on each continent? (just the blank skeleton here):

|countries|meanpop|medianpop|meangdp|mediangdp
---|---|---|---|---|---
**continent**|||||			
**africa**|||||
**americas**|||||
**asia**|||||
**europe**|||||
**oceania**|||||

write code to reproduce this table (with the numbers filled in)

In [4]:
## report table code


**3\. Comment on your results.**

## part4: data visualization with `matplotlib`

Recall that [matplotlib](http://matplotlib.org) is Python's main visualization library. It provides a range of tools for constructing plots, and numerous high-level plotting libraries are built with `matplotlib` in mind. When we were in the early stages of setting up our analysis, we loaded these libraries like so:

In [7]:
## don't forget to import and nickname
import matplotlib.pyplot as plt

Now, let's turn to data visualization. In order to get a feel for the properties
of the data set we are working with, data visualization is key. While, we will
focus only on the essentials of how to properly construct plots in univariate
and bivariate settings here, it's worth noting that matplotlib
supports a diversity of plots, check out the gallery for examples: [matplotlib gallery](http://matplotlib.org/gallery.html).
---

### Single variables

**Histograms** - provide a quick way of visualizing the distribution of numerical data, or the frequencies of observations for categorical variables. (***run the code*** in the cell below)...

In [None]:
## example histogram

# import numpy to get random numbers
import numpy as np

# generate some random numbers from a normal distribution
data = 100 + np.random.randn(500)

# make a histogram with 20 bins
plt.hist(data, 20)
plt.xlabel('x axis')
plt.ylabel('y axis')
plt.show()

### Pairs of variables

**Scatterplots** - provide visualization of relationships across two variables, for example (***run the code*** in the cell below)...

In [None]:
## example scatterplot

# import numpy to get random numbers
import numpy as np

# generate toy data
x = 2 * np.random.randn(55) + 5
y = 3 + 0.5 * x + np.random.randn(55)

# plot the data
plt.plot(x, y, 'bo') # blue dots
plt.show()

Your job for the plotting portion of this assignment is to:
+ make at least 3 plots of the `gapminder_comb` data set
+ at least one needs to be a histogram
+ at least one needs to be a scatterplot

***write and run*** your code below, make sure to leave in the results and comment on each plot.