## Get the Data

Either use the provided .csv file or (optionally) get fresh (the freshest?) data from running an SQL query on StackExchange: 

Follow this link to run the query from [StackExchange](https://data.stackexchange.com/stackoverflow/query/675441/popular-programming-languages-per-over-time-eversql-com) to get your own .csv file


## Import Statements

In [2]:
import pandas as pd 

## Data Exploration

**Challenge**: Read the .csv file and store it in a Pandas dataframe

In [32]:
df = pd.read_csv('QueryResults.csv', header=0, names=['Date', 'Tag', 'Posts'])

**Challenge**: Examine the first 5 rows and the last 5 rows of the of the dataframe

In [33]:
df.head()

Unnamed: 0,Date,Tag,Posts
0,2008-07-01 00:00:00,c#,3
1,2008-08-01 00:00:00,assembly,8
2,2008-08-01 00:00:00,c,83
3,2008-08-01 00:00:00,c#,506
4,2008-08-01 00:00:00,c++,164


In [34]:
df.tail()

Unnamed: 0,Date,Tag,Posts
2337,2022-08-01 00:00:00,php,3943
2338,2022-08-01 00:00:00,python,22633
2339,2022-08-01 00:00:00,r,4438
2340,2022-08-01 00:00:00,ruby,481
2341,2022-08-01 00:00:00,swift,1784


**Challenge:** Check how many rows and how many columns there are. 
What are the dimensions of the dataframe?

In [35]:
df.shape

(2342, 3)

**Challenge**: Count the number of entries in each column of the dataframe

In [48]:
sort_by_posts = df.groupby('Tag').sum().sort_values(by='Posts', ascending=False)

In [49]:
sort_by_posts

Unnamed: 0_level_0,Posts
Tag,Unnamed: 1_level_1
javascript,2413191
python,2007757
java,1859995
c#,1554177
php,1442223
c++,773760
r,460858
c,383941
swift,317112
ruby,225326


Let's add our index for clarity.

In [52]:
sort_by_posts = sort_by_posts.reset_index()

In [53]:
sort_by_posts

Unnamed: 0,Tag,Posts
0,javascript,2413191
1,python,2007757
2,java,1859995
3,c#,1554177
4,php,1442223
5,c++,773760
6,r,460858
7,c,383941
8,swift,317112
9,ruby,225326


We can see that Javascript is easily the most popular language as measured by number of posts. 

**Challenge**: Calculate the total number of post per language.
Which Programming language has had the highest total number of posts of all time?

Some languages are older (e.g., C) and other languages are newer (e.g., Swift). The dataset starts in September 2008.

**Challenge**: How many months of data exist per language? Which language had the fewest months with an entry? 


## Data Cleaning

Let's fix the date format to make it more readable. We need to use Pandas to change format from a string of "2008-07-01 00:00:00" to a datetime object with the format of "2008-07-01"

## Data Manipulation



**Challenge**: What are the dimensions of our new dataframe? How many rows and columns does it have? Print out the column names and print out the first 5 rows of the dataframe.

**Challenge**: Count the number of entries per programming language. Why might the number of entries be different? 

## Data Visualisaton with with Matplotlib


**Challenge**: Use the [matplotlib documentation](https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.plot.html#matplotlib.pyplot.plot) to plot a single programming language (e.g., java) on a chart.

**Challenge**: Show two line (e.g. for Java and Python) on the same chart.

# Smoothing out Time Series Data

Time series data can be quite noisy, with a lot of up and down spikes. To better see a trend we can plot an average of, say 6 or 12 observations. This is called the rolling mean. We calculate the average in a window of time and move it forward by one overservation. Pandas has two handy methods already built in to work this out: [rolling()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html) and [mean()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.window.rolling.Rolling.mean.html). 