# Solving Problems with Python: an Introduction to Data Science

## Lesson 2: interacting with data

In this lesson you're going to read in some sample data from a text file, then make a plot! You're going to see some stuff that we haven't talked about yet in some of the code samples I've given you below but don't worry--you'll understand it all soon! For now just ensure you can get them to work properly. 

### Exercise 1: reading in data from a .txt file

There are many different ways to do this (including several modules that we can import that accomplish this automatically) but we are going to start with a simple example. In the cell below I am providing you with a simple function called ```getListFromFile``` which takes a file inputted and spits the result out into an array for us to analyze. 

Before you can use it you need to make sure you have the dow.txt file located in the same place this notebook is running from (or you need to know the explicit filepath to this file). 

In [1]:
def getListFromFile (inputFile): #input file is the path to the file you want to access, if you put the 
                            #file in the same place as this notebook you just need to enter the filename
    dataList=[] #this initializes an empty list/array
    try: #we use try here in case there is an error (see except) -- don't worry too much about this now
        source=open(inputFile,"r") #open the file
        dataList=source.readlines() #read the file
        source.close() #close the file
    except FileNotFoundError: #if try fails return a value of -1 to indicate error
        return-1
    return dataList #if try goes well we spit out the resulting data

Make sure to run the cell above, then run the cell below to use this function to read in the data from the dow.txt file.

In [7]:
rawData = getListFromFile('dow.txt')

**Check:** Does this look right? Open up the text file in a text editor (should happen automatically if you double click on it from the file browser) and compare what's stored in ```rawData``` to what you see in the text file. Are they the same? 

To display the contents of ```rawData``` in this notebook simply run the following command in the cell below:
```python 
print(rawData)
```

In [None]:
#your code goes here


You should notice that the values stored in ```rawData``` look almost right, but there are some funky extra special characters trailing the numbers in the Python version you probably don't see when looking at the file in a text editor. This trailing special character (```\n```)tells the computer you want to go to a new line. You can confirm this behavior by running the following command in the cell below:
```python
print("hello \nworld")
```

In [11]:
#try it!


This is a *small* problem for us, because we want just the numbers. Computers treat different types of data differently, and right now the computer thinks that what we have in ```rawData``` is a bunch of text, ***not*** a bunch of numbers. We can check this with the ```type()``` function on a sample from the list. 

Run this code in the cell below:
```python 
type(rawData[0]) #the [0] means we are taking the first item from the list--Python starts counting from 0
```

In [14]:
#try it!


Str stands for string, which is computer-speak for text. This is easy for us to fix! We simply need to go through the list and convert each item to a decimal number, which are called **floats** in computer science. In Python we accomplish this by writing ```float(number_you_want_to_convert)```. 

In the cell below I've written some code that accomplishes this with a **for loop**. This is an important topic we will discuss more later, but for now just run it and investage the resulting ```dowData``` list to make sure that it worked!

In [15]:
dowData = [] #initialize an empty array for us to collect the numbers in 
for i in range(len(rawData)): #this means we are going to do the actions below to everything in the list
    dowData.append(float(rawData[i])) #take everything from the rawData list, turn it into a decimal, 
                                     #then add it to our dowData list

In [18]:
#print the dowData list and use type() on an item in the list to make sure that the conversion was 
#successful (it should spit out float and not str)
#your code here:


### Exercise 2: an introduction to plotting!

So we have some data, but now what? As humans we like to look at things, so the next logical step is for us to figure out a way to display this dataset in graphical form. Luckily for us there is a robust library built into Python that can automagically do a lot of what we want. It's called matplotlib! Run the cell below to import it and get it set up for our notebook.

In [22]:
import matplotlib.pyplot as plt #the as plt gives matplotlib.pyplot the name plt, so that we don't have
                                #to type matplotlib.pyplot everytime we want to use it.
%matplotlib inline 
#the line above helps us display the plots easier/better in our Jupyter notebook

We can get started right away by simply telling matplotlib to show us the data (no frills attached). To do this run ```plt.plot(dowData)``` in the cell below:

In [24]:
#your code goes here: 


#### Formatting:
This isn't a very pretty plot right now. Any true scientist knows that you always need to label your axes and have a proper title at the very least! Luckily for us these are commands are very easy, and below I've added the commands you'll need to fix these problems.

To add a title: ```plit.title("your title here")```

To add a label to the x-axis: ```plt.xlabel("your label here")```

To add a label to the y-axis: ```plt.ylabel("your label here")```

To explicitly show the plot (if you don't like the Out[#] nonsense): ```plt.show()```)

Each of these commands should go on its own line, ie
```python
plt.plot(myData)
plt.title("Here's a plot of some data I have!")
plt.xlabel("x axis (units)")
plt.ylabel("y axis (units)")
plt.show() #not required since we have %matplotlib inline, but still good to know
```

**Your turn:** Make a new plot of the data with proper labels and a title. The data you are plotting is the daily closing value of the Dow Jones Industrial Average Index (but I'm not telling you over what time period it occurs, as you're going to figure this out yourself soon...). 

In [26]:
#your code goes here


Right now we are just plotting "y" values and matplotlib is assuming what our x values are automatically (just numbering them from 1 to the number of bins we have). What if we want to change this? 

Let's do this now with the help of another module called ```numpy```. Import it by running the following code in the cell below:
```python
import numpy as np
```

In [27]:
#import numpy here


```numpy``` has a nice built-in function we are going to use called ```linspace()``` that allows us to easily create a range of numbers automatically. Here's an example of how ```linspace()``` works:
```python
xValues = np.linspace(startValue, endValue, length) #startValue is where the list/array starts, endValue 
#where it ends, and length is the number of items the resulting list/array will have.
```

**Your turn:** Create an array of x values that **starts at 0 and ends at 100** (to represent percentage of time elapsed) using the ```linspace()``` function. The value you pass in for length should be **the same** as the length of the y values (you can find this out by running ```len(dowData)```).

In [28]:
#your code goes here


### Exercise 3: more plotting

Now that you have lists (of the same size) for both x and y, let's learn how to plot them together and alter the look. To plot two items together using only default values, you can simply run ```plt.plot(xValues, yValues)``` but what if we want to change the color/type of the line, add a label, or otherwise modify it?

Here's a more complicated example (you should run it in a cell below and see the output to figure out what each part does):

```python
plt.plot(xCreatedAbove, dowData, "b--", label = "daily closing value")
plt.title("Value of the Dow Jones Industrial Average Index Over Time")
plt.xlabel("% time elapsed")
plt.ylabel("points")
plt.legend()
plt.show()
```

**Your turn:** Modify your Dow plot to use the new x axis data points you just created, update the x label, and futz around with colors/labels/marker styles. 

In [42]:
#your code goes here:


There's a lot of different options in plots you can pass in, so here's a good cheat sheet you can refer to. 

**Note:** if image is not displaying the file is not located in the same directory. Locate the file "matplotlib-cheatsheet.png" and move it to your notebook directory and it will appear!
<img src="matplotlib-cheatsheet.png">

Most of these things can be passed in as keyword arguments, for example to explicitly plot a turqoise line we can write the following:
```python
plt.plot(x, y, color = "turquoise") 
#this also works
plt.plot(x, y, c = "turquoise")
```

**Your turn:** Modify your plot by trying at least one new parameter from the cheatsheet above!

In [43]:
#your code goes here


### Exercise 4: analyzing the data

In the great recession that happened ~a decade ago the stock market dropped precipitously, bottoming out on March 9, 2009. Using this fact and our data above, you are going to figure out what dates this data starts and ends! Recall that each data point was taken one day apart, so if we can find the lowest one (and we know the lowest happened on 3/9/2009) we can do some simple math (Python is a great calculator!) to figure out when the data starts and stops. 

The easiest way to the location of the lowest data point is to simply check every point, and if it's the lowest then we will spit it out! In Python we can easily get the value of the lowest point by calling the built in ```min()``` function. Run the cell below and print out the value to see how it works:

In [47]:
lowestValue = min(dowData) #the correct value is 6547.05 -- make sure you get this!
#now print the value to check:


This is nice, but we need to know *where* the lowest value is, now just what the number is. To find where it is we will use a loop -- specifically a **for loop**. Simply put, this loop will execute **for the number of times you tell it to**. Here's an example:

In [48]:
for i in range(1,5):
    print(i)

1
2
3
4


This loop creates a variable called ```i``` that starts as 1. Each iteration of the loop prints the value of ```i``` and then increases ```i``` until ```i = 5```, then the loop stops (this is why it only prints out 4 -- once it gets to 5 it doesn't do anything else). We could have used any letter/variable for ```i```, but this is the standard. 

**Your turn:** Run the cell above again but change something! Make the range bigger/smaller, change ```i``` to ```j``` or ```unicorns``` or whatever else you might want, etc.

**Notice the syntax of the for loop:**

1. It requires there to be something to iterate (```i``` in our case).

2. It must have a range to iterate through (in our case 1 to 5).

3. It's a special function, so the end of the for statement **must end with a colon (:)**.

4. There must be something indented inside the loop that will be executed each iteration.

Recall that to get the "ith" item of a list/array in Python, we simply write ```myList[i]```. For example, if we wanted to get the first item (```i = 0```) we would run ```myList[0]```. 

What if we want to check and see if that item's value matches something? We do this with an **if statement**. Here's a sample if statement continuing with our list example:

```python
if myList[0] == 0: #we use == to check the value of something, just one = assigns value to a variable
    print("the first number in this list is 0") #this is what we want our code to do if it equals 0
else:
    print("the first number in this list is not 0") #this is what we want our code to do if it doesn't
```

**Notice the synatax of the if statement:**

1. There must be a condition you are checking (in our example we want to see if the first item of the list is zero).

2. There must be something indented inside the if statement that will execute if the condition is true (in our case if the first item is zero then we print a message saying so).

3. Optionally we can add an **else** statement -- this is what we want to do if the condition **is not** true. Rule number 2 applies to the else statement as well. If you have more than two options in mind you can use **elif condition:** (syntax the same as if but with elif in front) to add more choices for the program. 

4. It's a special kind of function (like the for loop) so don't forget to end each if/else statement **with a colon (:)**. 

Here's an example of how we could find the location of the maximum value of a list by combining a for loop and an if statement.
```python
maxVal = max(list) #this is the biggest number in the list
for i in range(0, len(list)): #i will go from 0 to the number of items in my list
    if list[i] == maxVal: #this statement checks if the ith element is equal to the maximum value
        print("The index of the biggest number in this list is:",i)
        break #this kills the loop/if statement -- there's no need for us to keep going once we found it!
        #there's no else statement for this if statement because I only care about finding the value...
        #if it doesn't find the value I just want it to keep going until it succeeds!
```
**Your turn:** Using the information/examples in this cell, find both the minimum value encoded in our dowList ***and where it's located***. Use this information to figure out what date the data starts and ends at! For example, if you find that the minimum is the 8th item in the list then the data would start on March 1st 2009. Don't forget that months have different number of days! You can use Python to help in your calculations, it can work pretty much like a normal calculator! 

Once you know this, modify your plot title or x axis to include the relevant information about what timespan the data is over.

**Challenge 1:** Add markers to the plot that indicate the start, stop, and lowest points on the graph -- refer to the matplotlib cheat sheet.

**Challenge 2:** Write a specialized function that will automatically figure out the date of an arbitrary number of days before or after 3/9/2009. 

In [None]:
#your code goes here
