# Solving Problems with Python: an Introduction to Data Science

## Lesson 3: an introduction to data processing/manipulation

In this lesson we are going to build off what you've learned in lesson two (so make sure you've done that!) in analyzing and processing data that we will read in from a file. 

Before we get started, familiarize yourself with accessing the Python documentation. Nearly every question you might have probably has an answer there, with code examples to boot! Make sure you know how to access it -- it will be very useful to you in the event I'm not here to answer your question (or you don't want to wait for me to finish helping someone else). 

Now let's learn how to use a new module -- I'm going to show you an alternate way of reading in a file with the CSV package. To use this we will need to import it.

In [1]:
import csv #make sure you run this cell!

Below is an example of how we can use this module to read in data, using our Dow data from the last exercise. 

In [6]:
dowData = []
with open('dow.txt') as f: #use the build in open function to open the file and give it the name "f"
    reader = csv.reader(f) #define a variable called reader that stores the results of the csv.reader() function
    for row in reader: #this loop goes through every row and does whatever we want on each iteration
        dowData.append(float(row[0])) #add each row (in decimal form) to dowData
        #why the [0]? each "row" is really a 1 item list -- the [0] takes just the number 

### Exercise 1: checking this new method

Check and see if this way of reading in the data produces the same result as in lesson two. Make a plot (**don't forget to import matplotlib!**) to show this. 

In [8]:
#your code goes here:


### Exercise 2: reading in more complicated data

Make sure you have the file ```baseball_players.csv``` in your Jupyter notebook directory. Before we do anything fancy with it it's a good idea to simply look at the data to see how it's organized. Here's an example of how we could spit out each "row" of data, using the Dow example above:
```python
with open('dow.txt') as f:
    reader = csv.reader(f)
    for row in reader:
        print(row) #notice everything is the same as before except this line
        #instead of saving the data to a list I just want to see it (for now)
```
**Your turn:** Modify this example to examine the ```baseball_players.csv``` file. What do you notice? Try modifying the print statement to something like ```print(row[0])``` or ```print(row[3])```. What does this return?

In [11]:
#your code goes here


You should notice that the first "row" contains header information -- these are like the column names in a spreadsheet. Our data has recorded the names, teams, positions, heights, weights, and ages of 1034 different players across the league.

**STOP!** Are the examples with the loops confusing you? If so (and/or you want some extra practice) open the loops notebook for a more detailed explanation before continuing. 

### Exercise 3: saving our data into useful lists

Now that we know what the data looks like we want to start working with it, but to do that we need to think of a smart way to organize it. There are many different ways to do this, but we will start with the simplest -- saving each column as its own list. Here's an example:

```python 
names = [] #initialize an empty list called names
with open('baseball_players.csv') as f:
    reader = csv.reader(f)
    lineCount = 0
    for row in reader: 
        if lineCount == 0: #first line of file, which I want to skip
            print("skipped the first line because it was just a header")
            lineCount += 1 #add one to lineCount 
        else:
            names.append(row[0]) #the first item in each row (row[0]) is the player's name 
            lineCount += 1 #add one to lineCount
    print("all done --", lineCount, "lines read (including header)")
```
**Your turn:** Modify the above example to get lists for each column in the data (names, teams, positions, heights, weights, and ages). Check that they are all 1034 items long (you can do this using the ```len()``` function, ie ```print(len(names))``` should spit out the number 1034.

**Important!** When you read in the numbers (heights, weights, and ages) ***you will need to tell the computer you want this to be a number, not a string***. Recall that we can accomplish this easily by using the built in ```int()``` and ```float()``` functions (to convert to integers or floats respectively). For example, for the age list we would need to modify the append command to something like ```ages.append(float(row[5]))```.

In [34]:
#your code goes here


### Exercise 4: analyzing and sorting our data

Now that we've organized our data into something more Python friendly we can start processing/manipulating it. Using your lists and Python skills, answer the following questions about this sample of MLB players (ranked in order of increasing difficulty): 


#### A: What is the average height of all players?

**Hint:** Remember the definition of an average is to add every number up and then divide by the total number of items in the list. You can accomplish this easily with the built in ```sum()``` function ie:
```python
SUM = sum(list) #this returns the sum of all values in the list
average = SUM/len(list) #to get the average we divide by the number of items in the list
```



In [50]:
#your solution to part A here


#### B: Who is the tallest player in the league (according to this data)?

**Hint:** Look back to your lesson two notebook, where we figured out how to find the maximum number and its index. 



In [40]:
#your solution to part B here


#### C: How many different MLB teams are represented in the data?

**Hint:** We need to go through the entire teams list and count how many *unique* entries there are. The easiest approach to tackling this problem is to create a new list to keep track of this -- see example below: 
```python
unique = [] #initialize list to keep track of unique entries
for item in list:
    if item in unique: #this checks if "item" is in the list "unique"
        continue #don't do anything because we've already tracked this one
    else:
        unique.append(item)
```


In [41]:
#your solution to part C here


#### D: Which team has the largest average height?

**Hint:** This is easiest to solve in two steps. First you need to find the average heights of each team individually (this is the hard part) then you need to figure out which is the largest (this is the easy part). 

Consider the following example:

```python
#let's say I want to find the average height of all BAL players...
hSum = 0
playerCount = 0
for i in range(len(heights)): #i will go from 0 to 1034 (the length of the heights list)
    if teams[i] == "BAL": #check if the team at this spot is BAL
        hSum += heigths[i] #if this is a BAL player add his height to our sum
        playerCount += 1 #add to playerCount (for use later)
hAverage = hSum/playerCount #divide sum by number of players (see definition of average)
```

In [42]:
#your solution to part D here


#### E: Which position has the smallest average weight?

**Hint:** This is basically the same exercise as part D, but now you need to find the *smallest* value instead of the largest and you are sorting by position instead of team.

In [43]:
#your solution to part E here


#### F: What is the average age and standard deviation of all baseball players? 

**Hint:** read the numpy documentation and search for the keywords "mean" (AKA average) and "standard deviation" then follow the examples there to solve this! For those unfamiliar the standard deviation is essentially a measure of how tightly packed your data are -- if the standard deviation is small that means most of your data is clustered together, whereas a large standard deviation indicates that your data is much more scattered/has a much wider spread.

In [44]:
#your solution to part F here


### Exercise 5: new kinds of plots

Now that you are masters of manipulating data, you're going to apply those skills to make some fancy new plots. In this exercise you'll need to make both a bar and a scatter plot, but don't worry -- there's examples of how to do both below!

#### Bar example:

```python
import matplotlib.pyplot as plt
categories = ['category 1', 'category 2', 'category 3', ...]
barSizes = [number1, number2, number3, ...]
plt.bar(categories, barSizes)
plt.title("This is a bar plot of the sizes of things in different categories")
plt.xlabel("categories (units)")
plt.ylabel("height of bar (units)")
plt.show()
```

#### Scatter example:
```python
import matplotlib.pyplot as plt
xValues = [some data...]
yValues = [some data...]
plt.scatter(xValues, yValues, keyword arguments... (see L2))
plt.title("this is a scatter plot")
plt.xlabel("x (units)")
plt.ylabel("y (units)")
plt.show()
```

#### Part A: generating BMI Lists
Body mass index (BMI) is defined as $BMI = \frac{weight \ (in \ kg)}{{height \ (in \ meters)}^{2}}$

To use this formula we first need to convert our data from pounds and inches to kilograms and meters. For reference, there are **$\sim$0.454 kilograms per pound** and **$\sim$0.0254 meters per inch.**

Using approaches similar to what you've already done in exercise 4, create a list that has each team name (only once) and a corresponding list with each team's BMI (using the formula above). Then make a bar chart of this data -- which team is the "fittest" according to this BMI calculation? Your "categories" will be the list of team names and your "barSizes" will be the average BMI of each team.

**Important:** to get the right answer you need to make sure the two lists you create match each other -- ie if the first team in your team list is "BAL" then the first number in your BMI list needs to be the average BMI for "BAL".

In [47]:
#your solution to part A here


#### Part B: plotting BMI vs age

Now we are going to investigate the BMI of every player to see and compare that with their age to see if there appears to be a correlation.

First, create a list that holds the BMI of every player (again the order is important so make sure you keep track of it) -- these will be your "yValues" and you can create them easily using the weights and heights lists you already made. 

Next, make a scatter plot with this new BMI list and your ages list (which are your "xValues") and display the result. Refer to the image in L2 to make adjustments to enhance the quality of your plot.

Does their appear to be a correlation between age and BMI?

**Challenge:** Plot the data as before but find a way to make the colors of each point change according to what team they are on and report this in a legend.

In [48]:
#your solution to part B here
