# this is a juniper notebook for learning pandas 

In [2]:
import pandas as pd
import numpy as np

## creation of dataframes: 
The DataFrame is essentially a two dimensional object, and it can be created in three different ways:

- out of a two dimensional NumPy array
- out of given columns
- out of given rows

### creation of Dataframes from a NumPy array

In the following example a DataFrame with 2 rows and 3 column is created. The row and column indices are given explicitly.

In [3]:
df=pd.DataFrame(np.random.randn(2,3), columns=["First", "Second", "Third"], index=["a", "b"])
df

Unnamed: 0,First,Second,Third
a,-0.785463,-0.041614,1.001909
b,1.004154,0.092436,1.598727


you can use <name>.index and <name>.coulmns to find the names of the coulmns and indexes 
- if you leave out coulmns or index names poandas will fill them in with integers starting from 1

In [4]:
df.index

Index(['a', 'b'], dtype='object')

In [5]:
df.columns

Index(['First', 'Second', 'Third'], dtype='object')

## Creating DataFrames from rows
We can give a list of rows as a parameter to the DataFrame constructor. Each row is given as a dict, list, Series, or NumPy array. 

eg:

In [6]:
df=pd.DataFrame([{"Wage" : 1000, "Name" : "Jack", "Age" : 21}, {"Wage" : 1500, "Name" : "John", "Age" : 29}])
df

Unnamed: 0,Wage,Name,Age
0,1000,Jack,21
1,1500,John,29


## Accessing columns and rows of a dataframe
when you index a coulmn or a row you need to specify the name of the column or row. You cannot just index using an integer value. The same goes for fgancy indexing when you index multipul items.
EG:

In [7]:
try:
    df[0]
except KeyError:
    import sys
    print("Key error", file=sys.stderr)

Key error


In [8]:
df['Wage']

0    1000
1    1500
Name: Wage, dtype: int64

In [9]:
df[['Wage', 'Name']]

Unnamed: 0,Wage,Name
0,1000,Jack
1,1500,John


in pandas, you can however use regular integer indexing through slicing for rows when the rows are not labeled and use boolean masks.

In [10]:
df[0:1]

Unnamed: 0,Wage,Name,Age
0,1000,Jack,21


In [11]:
df[df.Wage > 1200]

Unnamed: 0,Wage,Name,Age
1,1500,John,29


## Alternative indexing and data selection

There is another way to index Pandas DataFrames, which
- allows use of index pairs to access a single element
- has the same order of dimensions as NumPy: first index specifies rows, second columns
- is not ambiguous about implicit or explicit indices
these are called loc and iloc

The difference between loc and iloc attributes is that the former uses explicit indices and the latter uses the implicit integer indices.
eg:

In [12]:
df.loc[1, "Wage"]

1500

In [13]:
df.iloc[-1,-1]             # Right lower corner of the DataFrame

29

In [14]:
df.loc[1, ["Name", "Wage"]]

Name    John
Wage    1500
Name: 1, dtype: object

## Summary stratistics
The summary statistic methods work in a similar way as their counter parts in NumPy. By default, the aggregation is done over columns.

In [15]:
wh = pd.read_csv("https://raw.githubusercontent.com/csmastersUH/data_analysis_with_python_2020/master/kumpula-weather-2017.csv")

wh2 = wh.drop(["Year", "m", "d"], axis=1)  # taking averages over these is not very interesting
wh2.mean()

  wh2.mean()


Precipitation amount (mm)    1.966301
Snow depth (cm)              0.966480
Air temperature (degC)       6.527123
dtype: float64

the describe method gives a different summary stratistics for each numeric column. The resulty is a dataframe

In [16]:
wh.describe()

Unnamed: 0,Year,m,d,Precipitation amount (mm),Snow depth (cm),Air temperature (degC)
count,365.0,365.0,365.0,365.0,358.0,365.0
mean,2017.0,6.526027,15.720548,1.966301,0.96648,6.527123
std,0.0,3.452584,8.808321,4.858423,3.717472,7.183934
min,2017.0,1.0,1.0,-1.0,-1.0,-17.8
25%,2017.0,4.0,8.0,-1.0,-1.0,1.2
50%,2017.0,7.0,16.0,0.2,-1.0,4.8
75%,2017.0,10.0,23.0,2.7,0.0,12.9
max,2017.0,12.0,31.0,35.0,15.0,19.6


## Missing data 

we can see that the Snow depth has 358 entries, yet there are only 365 days in a year, so how? 
this si because Snowdeapth likly has somw nan entries. In Pandas nan can be used to represent a missing value. In the weather DataFrame the nan value tells us that the measurement from that day is not available, possibly due to a broken measuring instrument or some other problem.

we can gget rid of nan values by using the isnull() mothod in pandas 
(the notnull mehtod works conversly to the isnull mothod)

In [17]:
wh.isnull()

Unnamed: 0,Year,m,d,Time,Time zone,Precipitation amount (mm),Snow depth (cm),Air temperature (degC)
0,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...
360,False,False,False,False,False,False,False,False
361,False,False,False,False,False,False,False,False
362,False,False,False,False,False,False,False,False
363,False,False,False,False,False,False,False,False


we can then combine this with the any() method to fiund all rows that contine at least one missing value 

In [18]:
wh[wh.isnull().any(axis=1)]

Unnamed: 0,Year,m,d,Time,Time zone,Precipitation amount (mm),Snow depth (cm),Air temperature (degC)
74,2017,3,16,00:00,UTC,1.8,,3.4
163,2017,6,13,00:00,UTC,0.6,,12.6
308,2017,11,5,00:00,UTC,0.2,,8.4
309,2017,11,6,00:00,UTC,2.0,,7.5
313,2017,11,10,00:00,UTC,3.6,,7.2
321,2017,11,18,00:00,UTC,11.3,,5.9
328,2017,11,25,00:00,UTC,8.5,,4.2


finally we can use the dropna() method to prod the columns contining missing values 

In [19]:
wh.dropna(axis=1).shape

(365, 7)

alternitivly, we can use the fillna() method to replace the missing values 

The method parameter can be:

- None: use the given positional parameter as the constant to fill missing values with
- ffill: use the previous value to fill the current value
- bfill: use the next value to fill the current value

For example:

In [20]:
wh = wh.fillna(method='ffill')
wh[wh.isnull().any(axis=1)]

Unnamed: 0,Year,m,d,Time,Time zone,Precipitation amount (mm),Snow depth (cm),Air temperature (degC)


The interpolate method, which we will not cover here, offers more elaborate ways to interpolate the missing values from their neighbouring non-missing values.