# this is a juniper notebook for learning pandas 

In [10]:
import pandas as pd
import numpy as np

## creation of dataframes: 
The DataFrame is essentially a two dimensional object, and it can be created in three different ways:

- out of a two dimensional NumPy array
- out of given columns
- out of given rows

### creation of Dataframes from a NumPy array

In the following example a DataFrame with 2 rows and 3 column is created. The row and column indices are given explicitly.

In [11]:
df=pd.DataFrame(np.random.randn(2,3), columns=["First", "Second", "Third"], index=["a", "b"])
df

Unnamed: 0,First,Second,Third
a,-0.85923,-0.001946,-1.474767
b,-0.902837,-0.222354,-0.911961


you can use <name>.index and <name>.coulmns to find the names of the coulmns and indexes 
- if you leave out coulmns or index names poandas will fill them in with integers starting from 1

In [12]:
df.index

Index(['a', 'b'], dtype='object')

In [13]:
df.columns

Index(['First', 'Second', 'Third'], dtype='object')

## Creating DataFrames from rows
We can give a list of rows as a parameter to the DataFrame constructor. Each row is given as a dict, list, Series, or NumPy array. 

eg:

In [14]:
df=pd.DataFrame([{"Wage" : 1000, "Name" : "Jack", "Age" : 21}, {"Wage" : 1500, "Name" : "John", "Age" : 29}])
df

Unnamed: 0,Wage,Name,Age
0,1000,Jack,21
1,1500,John,29


## Accessing columns and rows of a dataframe
when you index a coulmn or a row you need to specify the name of the column or row. You cannot just index using an integer value. The same goes for fgancy indexing when you index multipul items.
EG:

In [15]:
try:
    df[0]
except KeyError:
    import sys
    print("Key error", file=sys.stderr)

Key error


In [17]:
df['Wage']

0    1000
1    1500
Name: Wage, dtype: int64

In [18]:
df[['Wage', 'Name']]

Unnamed: 0,Wage,Name
0,1000,Jack
1,1500,John


in pandas, you can however use regular integer indexing through slicing for rows when the rows are not labeled and use boolean masks.

In [30]:
df[0:1]

Unnamed: 0,Wage,Name,Age
0,1000,Jack,21


In [32]:
df[df.Wage > 1200]

Unnamed: 0,Wage,Name,Age
1,1500,John,29


## Alternative indexing and data selection

There is another way to index Pandas DataFrames, which
- allows use of index pairs to access a single element
- has the same order of dimensions as NumPy: first index specifies rows, second columns
- is not ambiguous about implicit or explicit indices
these are called loc and iloc

The difference between loc and iloc attributes is that the former uses explicit indices and the latter uses the implicit integer indices.
eg:

In [33]:
df.loc[1, "Wage"]

1500

In [34]:
df.iloc[-1,-1]             # Right lower corner of the DataFrame

29

In [35]:
df.loc[1, ["Name", "Wage"]]

Name    John
Wage    1500
Name: 1, dtype: object

## Summary stratistics
The summary statistic methods work in a similar way as their counter parts in NumPy. By default, the aggregation is done over columns.

In [36]:
wh = pd.read_csv("https://raw.githubusercontent.com/csmastersUH/data_analysis_with_python_2020/master/kumpula-weather-2017.csv")

wh2 = wh.drop(["Year", "m", "d"], axis=1)  # taking averages over these is not very interesting
wh2.mean()

  wh2.mean()


Precipitation amount (mm)    1.966301
Snow depth (cm)              0.966480
Air temperature (degC)       6.527123
dtype: float64

the describe method gives a different summary stratistics for each numeric column. The resulty is a dataframe

In [37]:
wh.describe()

Unnamed: 0,Year,m,d,Precipitation amount (mm),Snow depth (cm),Air temperature (degC)
count,365.0,365.0,365.0,365.0,358.0,365.0
mean,2017.0,6.526027,15.720548,1.966301,0.96648,6.527123
std,0.0,3.452584,8.808321,4.858423,3.717472,7.183934
min,2017.0,1.0,1.0,-1.0,-1.0,-17.8
25%,2017.0,4.0,8.0,-1.0,-1.0,1.2
50%,2017.0,7.0,16.0,0.2,-1.0,4.8
75%,2017.0,10.0,23.0,2.7,0.0,12.9
max,2017.0,12.0,31.0,35.0,15.0,19.6
