# MDI220: Statistics

# Getting started for the project

This notebook shows how to load and visualize data with Pandas on a simple example.

If you're not familiar with Python, check this set of [exercises](https://www.w3resource.com/python-exercises/).

If you're not familiar with Numpy, Pandas, Matplotlib, Seaborn, check this set of [notebooks](https://github.com/tbonald/python_data_science) inspired by the [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/index.html).


## Import

In [None]:
import pandas as pd

## Data

You first need to download datasets from eCampus.

In [None]:
from os import path

In [None]:
# check datasets
if not all([path.isfile(filename) for filename in ['power.txt', 'temperature.txt']]):
    print('Please download the datasets and save them in the working directory.')
else:
    print("You're ready!")

## Power consumption

Consider the evolution of [electric power consumption](https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption) in Sceaux from December 2006 to November 2010.



We focus on the variable ```Global_active_power```.

In [None]:
df = pd.read_csv('power.txt', sep=';', usecols=['Date', 'Time', 'Global_active_power'], low_memory=False)

In [None]:
df.tail()

In [None]:
df = df.rename(columns={'Global_active_power': 'Power'})

## To do

* Remove non-numerical values from the ``Power`` column.
* Use ``to_datetime`` and ``set_index`` to index the DataFrame by dates. 
* Show the evolution of daily average power in the period January-June 2010.


In [None]:
df['Power'] = pd.to_numeric(df['Power'], errors='coerce')

In [None]:
df_clean = df.dropna()

In [None]:
# fraction of valid values
len(df_clean) / len(df)

In [None]:
df = df_clean

In [None]:
df['Datetime'] = df.Date.astype(str) + ' ' + df.Time.astype(str)
df = df.drop(columns=['Date', 'Time'])

In [None]:
df.dtypes

In [None]:
df['Datetime'] = pd.to_datetime(df['Datetime'], infer_datetime_format=True)

In [None]:
df.dtypes

In [None]:
# set index
df = df.set_index('Datetime')

In [None]:
df.head()

In [None]:
# daily average
df_power = df.resample('D').mean()

In [None]:
# rename index
df_power.index.name = 'Date'

In [None]:
# slicing
df_power['2007-01-01':'2007-01-07']

In [None]:
# plot from Jan 2010 to June 2010
df_power['2010-01-01':'2010-06-30'].plot();

## Temperatures

We now consider the evolution of temperatures. 

## To do

* Load data with ``pandas``and select columns ``DATE`` and ``TG``.
* Rename these columns as ``Date`` and ``Temperature``.
* Divide the column ``Temperature`` by 10 to get temperatures in degree Celsius. Remove anomalies.
* Build a  DataFrame of daily temperatures for the period January-June 2010.
* Display aligned plots with temperatures and power consumption.

In [None]:
df = pd.read_csv('temperature.txt', comment='#')

In [None]:
df.head()

In [None]:
df.columns

In [None]:
# remove spaces
df.columns = df.columns.str.replace(' ', '')

In [None]:
df = df[['DATE', 'TG']]

In [None]:
df = df.rename(columns={'DATE':'Date', 'TG':'Temperature'})

In [None]:
df.dtypes

In [None]:
df['Temperature'] = df['Temperature'].apply(lambda x: x/10);

In [None]:
df.head()

In [None]:
max(df['Temperature'])

In [None]:
min(df['Temperature'])

In [None]:
df_clean = df[df['Temperature'] > -100]

In [None]:
# fraction of valid values
len(df_clean) / len(df)

In [None]:
df = df_clean

In [None]:
df['Date'] = pd.to_datetime(df['Date'], format="%Y%m%d")

In [None]:
df.head()

In [None]:
# set index
df = df.set_index('Date')

In [None]:
df_temp = df

In [None]:
# plot from Jan 2010 to June 2010
df_temp['2010-01-01':'2010-06-30'].plot();

In [None]:
# merge
df = pd.merge(df_power, df_temp, left_index=True, right_index=True)

In [None]:
df.head()

In [None]:
df.dtypes

In [None]:
df['2010-01-01':'2010-06-30'].plot(subplots=True);