# Working with Data

We will learn how to work with data on a bare-metal 
level - just raw Python.  Later we will learn to use
(and to appreciate) powerful data manipulation packages
like Pandas.  As you will see, a good deal of effort
has to be put into "cleaning" the data and brining 
it to the right form.

The first task is to read data from a file.  The file
is "global_temperature_anomalies.text".  We begin 
by reading it into a string:

In [None]:
def read_file(file_name):
    file = open(file_name)
    data = file.read()
    file.close()
    return data

data_as_string = read_file('data/global_temperature_anomaly.csv')
len(data_as_string)

In [None]:
data_as_string[:600]

# Note the use of slices to view just part of the data.

In [None]:
# Ugh! Very hard to read. Let's split the data into list of lines.

def string_to_lines(data):
    return data.split("\n")

data_as_lines = string_to_lines(data_as_string)

data_as_lines[-10:]

In [None]:
# We need to get rid of lines that don't have data in them.
# To do this we use array slices.  We will omit the first
# seven line as well as the very last line:

good_lines = data_as_lines[7:-1]  

good_lines[:4]

In [None]:
# OK, this is looking much better!  But there is one more 
# step.  We have to split each line into the year and the 
# temperature anomaly:

data = list(map(lambda x: x.split(","), good_lines))

data[:4]

In [None]:
# Once we have have done this, we can separate the 
# years from the anomalies and convert the strings to
# floating point numbers

years = list(map(lambda x: float(x[0]), data))
anomalies = list(map(lambda x: float(x[1]), data))

years[:3], anomalies[:3], len(years), len(anomalies)

In [None]:
# And now we can make a plot of the data:

from matplotlib import pyplot as plt
%matplotlib inline

plt.plot(years, anomalies)
plt.title("Temperature anomaly")
plt.ylabel("Degrees C")
plt.xlabel("Year")
plt.savefig('anomalies.png')



In [None]:
# We are now going to smooth the data using 
# moving averages.  This is a way to better
# see long term trends unobscured by short-term
# variation

def smooth(data, window):
  output = []
  n = len(data)
  for k in range(0, n - window + 1):
      segment = data[k:(k + window)]
      value = sum(segment)/window
      output.append(value)
  return output

def drop_window(data, window):
  return data[window-1:]

In [None]:
window = 10
years2 = drop_window(years, window)
anomalies1 = drop_window(anomalies, window)
anomalies2 = smooth(anomalies,window)

print (len(years2), len(anomalies1), len(anomalies2))

plt.plot( years2, anomalies1, color='red', linestyle='solid')
plt.plot( years2, anomalies2, color='blue', linestyle='solid')
plt.title("Temperature anomaly")
plt.ylabel("Degrees C")

plt.show()
plt.savefig('smoothed_anomalies.png')

In [None]:
# Let's find a line which "best fits" the data.
# The function np.polyfit will give us the coefficents

import numpy as np

m, b = np.polyfit(years2, anomalies1, 1)
m, b

In [None]:
# Using these coefficents, we define a function y(x)
# which parameterizes the line

def y(x):
   return m*x + b

# And we can use it to make a predicttion of the temperature
# anomaly in 2040:

y(2040)

In [None]:
# To see whether this was a good prediction, 
# lets graph the line against that data. First,
# the data for that line:

linfit = list(map(lambda x: y(x), years2))

In [None]:
# We can draw a data with a line

plt.plot( years2, anomalies1, color='red', linestyle='solid')
# plt.plot( years2, anomalies2, color='blue', linestyle='solid')
plt.plot( years2, linfit, color='green', linestyle='solid')
plt.title("Temperature anomaly")
plt.ylabel("Degrees C")

plt.show()


In [None]:
# (1) Why is our line of best fit misleading?
# (2) Why is our predicted temperature anomaly clearly flawed?
# (3) How might we make a better prediction?

In [None]:
a = [2,3, -9, 0, 55, -17]
list(filter(lambda x: x > 0, a))