## Get the Data

Either use the provided .csv file or (optionally) get fresh (the freshest?) data from running an SQL query on StackExchange: 

Follow this link to run the query from [StackExchange](https://data.stackexchange.com/stackoverflow/query/675441/popular-programming-languages-per-over-time-eversql-com) to get your own .csv file

<code>
select dateadd(month, datediff(month, 0, q.CreationDate), 0) m, TagName, count(*)
from PostTags pt
join Posts q on q.Id=pt.PostId
join Tags t on t.Id=pt.TagId
where TagName in ('java','c','c++','python','c#','javascript','assembly','php','perl','ruby','visual basic','swift','r','object-c','scratch','go','swift','delphi')
and q.CreationDate < dateadd(month, datediff(month, 0, getdate()), 0)
group by dateadd(month, datediff(month, 0, q.CreationDate), 0), TagName
order by dateadd(month, datediff(month, 0, q.CreationDate), 0)
</code>

## Import Statements

In [None]:
import pandas as pd

## Data Exploration

**Challenge**: Read the .csv file and store it in a Pandas dataframe

In [None]:
df = pd.read_csv("./QueryResults.csv", header=0)
df.columns = ['DATE', 'TAG', 'POSTS']

**Challenge**: Examine the first 5 rows and the last 5 rows of the dataframe

In [None]:
df.head()

**Challenge:** Check how many rows and how many columns there are. 
What are the dimensions of the dataframe?

In [None]:
shape = df.shape
dimensions = df.ndim
print(f"Shape: {shape}\nDimensions: {dimensions}")

**Challenge**: Count the number of entries in each column of the dataframe

In [None]:
df.count()

**Challenge**: Calculate the total number of post per language.
Which Programming language has had the highest total number of posts of all time?

In [None]:
df.groupby("TAG").sum()

Some languages are older (e.g., C) and other languages are newer (e.g., Swift). The dataset starts in September 2008.

**Challenge**: How many months of data exist per language? Which language had the fewest months with an entry? 


In [None]:
df.groupby("TAG")["DATE"].count()

## Data Cleaning

Let's fix the date format to make it more readable. We need to use Pandas to change format from a string of "2008-07-01 00:00:00" to a datetime object with the format of "2008-07-01"

In [None]:
df["DATE"].iloc[1]

In [None]:
type(df.DATE[1])

In [None]:
df["DATE"] = pd.to_datetime(df["DATE"])
print(type(df["DATE"][1]))
df["DATE"]

## Data Manipulation
Can you pivot the df DataFrame so that each row is a date and each column is a programming language? Store the result under a variable called ```reshaped_df```. 


In [None]:
reshaped_df = df.pivot(index="DATE", columns="TAG", values="POSTS")
reshaped_df

**Challenge**: What are the dimensions of our new dataframe? How many rows and columns does it have? Print out the column names and print out the first 5 rows of the dataframe.

In [None]:
reshaped_df_dimensions = reshaped_df.ndim
reshaped_df_shape = reshaped_df.shape
print(f"Reshaped DataFrame Dimensions:  {reshaped_df_dimensions}")
print(f"Reshaped DataFrame Shape: {reshaped_df_shape}")

In [None]:
reshaped_df.head()

In [None]:
reshaped_df.tail()

**Challenge**: Count the number of entries per programming language. Why might the number of entries be different? 

In [None]:
reshaped_df.count()

In [None]:
reshaped_df.columns

In [None]:
reshaped_df.fillna(value=0, inplace=True)  # Filling the NaN (Not a Number) cell values with zero and applying changes.
reshaped_df.head()

### Check for NaN Values

In [None]:
reshaped_df.isna().values.any()
# Now we're all set to create some charts and visualise our data. For all of that and more, I'll see you in the next lesson!

## Data Visualisaton with with Matplotlib


**Challenge**: Use the [matplotlib documentation](https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.plot.html#matplotlib.pyplot.plot) to plot a single programming language (e.g., java) on a chart.

In [None]:
import matplotlib.pyplot as plt

In [None]:
# Plotting java
plt.plot(reshaped_df.index, reshaped_df["java"])

## Styling the Chart

Let's look at a couple of methods that will help us style our chart:

* ```.figure()``` - allows us to resize our chart

* ```.xticks()``` - configures our x-axis

* ```.yticks()``` - configures our y-axis

* ```.xlabel()``` - add text to the x-axis

* ```.ylabel()``` - add text to the y-axis

* ```.ylim()``` - allows us to set a lower and upper bound



### To make our chart larger we can provide a width (16) and a height (10) as the ```figsize``` of the figure.
This will make our chart easier to see. But when we increase the size of the chart, we should also increase the fontsize of the ticks on our axes so that they remain easy to read:

In [None]:
plt.figure(figsize=(16,10)) 
plt.plot(reshaped_df.index, reshaped_df.java)

**Challenge**: Show two line (e.g. for Java and Python) on the same chart.

In [None]:
plt.plot(reshaped_df["java"], reshaped_df["python"])

### Now we can add labels. Also, we're never going to get less than 0 posts, so let's set a lower limit of 0 for the y-axis with ```.ylim()```.
Challenge: Try to plot both python and java together?

In [None]:
plt.figure(figsize=(16,10)) 
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.xlabel("Date", fontsize=14)
plt.ylabel("Number of Posts", fontsize=14)
plt.ylim(0, 35000)
plt.plot(reshaped_df.index, reshaped_df.java, label="Java")
plt.plot(reshaped_df.index, reshaped_df.python, label="Python")
plt.legend(loc="upper left")

### What if we wanted to plot all programming languages?


In [None]:
plt.figure(figsize=(16,10)) 
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.xlabel("Date", fontsize=14)
plt.ylabel("Number of Posts", fontsize=14)
plt.ylim(0, 35000)
for col in reshaped_df.columns:
    plt.plot(reshaped_df.index, reshaped_df[col], label=f"{col}", linewidth=3)
    plt.legend()

# Smoothing out Time Series Data

Time series data can be quite noisy, with a lot of up and down spikes. To better see a trend we can plot an average of, say 6 or 12 observations. This is called the rolling mean. We calculate the average in a window of time and move it forward by one overservation. Pandas has two handy methods already built in to work this out: [rolling()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html) and [mean()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.window.rolling.Rolling.mean.html). 

In [None]:
# The window is number of observations that are averaged
roll_df = reshaped_df.rolling(window=12).mean()
 
plt.figure(figsize=(16,10))
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.xlabel('Date', fontsize=14)
plt.ylabel('Number of Posts', fontsize=14)
plt.ylim(0, 35000)
 
# plot the roll_df instead
for column in roll_df.columns:
    plt.plot(roll_df.index, roll_df[column], 
             linewidth=3, label=roll_df[column].name)
 
plt.legend(fontsize=16)