### Review of Pandas

You can find at here : What can we de with Pandas library.
I included these subjects to kernel :
* Review of Pandas
* Building data frames form scratch
* Visual exploratory data analysis
* Statictical exploratory data analysis
* Indexing Pandas time series
* Resampling Pandas time series

As you notice, I do not give all idea in a same time. Although, we learn some basics of pandas, we will go deeper in pandas.
* single column = series
* NaN = not a number
* dataframe.values = numpy

In [107]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

['2017.csv', '2016.csv', '2015.csv']


### BUILDING DATA FRAMES FROM SCRATCH
* We can build data frames from csv as we did earlier.
* Also we can build dataframe from dictionaries
    * zip() method: This function returns a list of tuples, where the i-th tuple contains the i-th element from each of the argument sequences or iterables.
* Adding new column
* Broadcasting: Create new column and assign a value to entire column

In [None]:
country = ['Turkey', 'Germany']#values
population = ['80000000', '65000000']#values
list_label = ['country', 'population']#column names
list_column = [country, population]#values have relation with columns 
zipped = list(zip(list_label,list_column))#mwe used zip() method for creating a table form
data_dict = dict(zipped)#zip() method returns tuple data type. We have to change it to dictionary for using with Pandas
data = pd.DataFrame(data_dict)#dictionary transformed into Data Frame.
data

I have added a new column to data frame.

In [None]:
data['capital'] = ['Ankara', 'Berlin']
data

What is broadcasting ?

Broadcasting is, adding a new column and giving a value. And it works like map function. Broadcasting the value to all data.  

In [None]:
data['test'] = '123'
data

### VISUAL EXPLORATORY DATA ANALYSIS
* Plot
* Subplot
* Histogram:
    * bins: number of bins
    * range(tuble): min and max values of bins
    * normed(boolean): normalize or not
    * cumulative(boolean): compute cumulative distribution

In [None]:
data1 = pd.read_csv('../input/2017.csv') #creating new data frame from World Happiness Report dataset.
data1.head()
data1_cols = data1.columns
data1_cols = data1_cols.str.replace('.','_')
data1.columns = data1_cols
data1.head()

In [None]:
data2 = data1.loc[:,['Happiness_Rank', 'Family', 'Freedom']]
data2.plot()

I have plotted but it is meaningless. Therefore I am using subplots.

In [None]:
data2.plot(subplots = True, figsize= (12,12))
plt.show()

In [None]:
data2.plot(kind='scatter', x= 'Happiness_Rank', y='Freedom', color = 'Red',grid = True, alpha=0.5, figsize = (12,12))
plt.show()

I am plotting histogram figure. I am using normed and range parameters this time. 
* Range is used for limiting x-axis. At this example, I have limited Happiness Rank values in-between 0-250.
* Density is used for formalizating y-label in between 0-1. At this example, I have formalized frequency value of Happiness Rank column.

In [None]:
data2.plot(kind='hist', y= 'Happiness_Rank', bins = 50, range=(0,250), density=True)
plt.show()

In [None]:
fig,axes = plt.subplots(nrows=2 , ncols=1)
data2.plot(kind='hist', y='Freedom', bins = 50, range=(0,1), density=True, ax= axes[0])
data2.plot(kind='hist', y='Happiness_Rank', bins= 50, range=(0,250), density=True, ax=axes[1], cumulative = True)
plt.savefig('graph.png')
plt.show()

There is an example of how to use ' cumulative ' parameter. This parameter works like fibonnaci . For each value, sums each value before it and returns result of sum.

### STATISTICAL EXPLORATORY DATA ANALYSIS
I already explained it at previous parts. However lets look at one more time.
* count: number of entries
* mean: average of entries
* std: standart deviation
* min: minimum entry
* 25%: first quantile
* 50%: median or second quantile
* 75%: third quantile
* max: maximum entry

In [None]:
data1.describe()

### INDEXING PANDAS TIME SERIES
* datetime = object
* parse_dates(boolean): Transform date to ISO 8601 (yyyy-mm-dd hh:mm:ss ) format

In [None]:
time_list=["1995-04-23","1999-11-14","1989-1-17","1996-6-5","1999-11-4"]
print(type(time_list[1]))#returns string
# lets convert it to datetime object
data3 = data2.head()
datetime_object = pd.to_datetime(time_list)#converting process to datetime
data3['date'] = datetime_object#adding to dataframe as a column
data3

Lets drop the index values and creating a new index from datetime list.

In [None]:
data3 = data3.set_index('date')
data3

I am selecting data using time index.

In [None]:
print(data3.loc['1999-11-14'])
print(data3.loc['1989-01-17':'1999-11-04'])

### RESAMPLING PANDAS TIME SERIES
* Resampling: statistical method over different time intervals
    * Needs string to specify frequency like "M" = month or "A" = year
* Downsampling: reduce date time rows to slower frequency like from daily to weekly
* Upsampling: increase date time rows to faster frequency like from daily to hourly
* Interpolate: Interpolate values according to different methods like ‘linear’, ‘time’ or index’ 
    * https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.interpolate.html

In [None]:
data3.resample('A').mean()

At here, I resample by year (using ' A '). Showing values from dataframe, and putting NaN values for not found values in dataframe.

In [None]:
data3.resample('M').mean()

I resample by months (using ' M '). Showing values from dataframe, and putting NaN values for not found values in dataframe.

In [None]:
data3.resample('M').first().interpolate('linear')

Interpolate method fills the NaN values with values between the highest and the lowest index.
There is another example of resampling : 

In [None]:
data3.resample("M").mean().interpolate("linear")