Import necessary libraries

In [1]:
import numpy as np
import pandas as pd

# Introduction to data structure

> All Pandas data structures are **value mutable** (can be changed) and except Series all are **size mutable**. Series is **size immutable**.

| Data | Dimension | Description|
|------|-----------|------------|
|Series| 1 | 1D labeled homogeneous array, sizeimmutable|
|Data Frames | 2 | General 2D labeled, size-mutable tabular structure with potentially heterogeneously typed columns. | 
|Panel|3|General 3D labeled, size-mutable array.|


## Series

Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index.

![](http://drive.google.com/uc?export=view&id=1QN-HVLOCMp4NdvlVCpGS8KALbeN3MQxA)

Series Basic Functionality

|Attribute & Method | Description |
|-----|---------|
|`axes`|Returns a list of the row axis labels
|`dtype`|Returns the dtype of the object|
|`empty`|Returns True if series is empty|
|`ndim`|Returns the number of dimensions of the underlying data, by definition 1|
|`size`|Returns the number of elements in the underlying data|
|`values`|Returns the Series as ndarray|
|`head()`|Returns the first n rows.|
|`tail()`|Returns the last n rows.|




### Create/Initialize Series

**Creation**
`pandas.Series( data, index, dtype, copy)`
![](http://drive.google.com/uc?export=view&id=1GtJ_I294vI4_TJxNTDSylyhRx2OgV9EZ)

#### Create an Empty Series
A basic series, which can be created is an Empty Series.

In [2]:
s = pd.Series()
print(s)

Series([], dtype: float64)


  """Entry point for launching an IPython kernel.


#### Create a Series from ndarray
If data is an ndarray, then index passed **must** be of the same length. 
+ If no index is passed, then by default index will be range(n) where n is array length
+ The input list is the column value of sequence


In [3]:
data = np.array(['a','b','c','d'])
s = pd.Series(data)
print(s)

0    a
1    b
2    c
3    d
dtype: object


In [4]:
data = np.array(['a','b','c','d'])
s = pd.Series(data,index=[100,101,102,103])
print(s)

100    a
101    b
102    c
103    d
dtype: object


In [5]:
data = np.array(['a','b','c','d'])
s = pd.Series(data,index=np.arange(data.shape[0]))
print(s)

0    a
1    b
2    c
3    d
dtype: object


#### Create a Series from dict
A dict can be passed as input 
+ If no index is specified, then the dictionary keys are taken in a sorted order to construct index.
+ If index is passed, the values in data corresponding to the labels in the index will be pulled out.

*Note*: Each element in dictionary is a row value

In [6]:
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data)
print(s)

a    0.0
b    1.0
c    2.0
dtype: float64


In [7]:
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data,index=['b','c','d','a'])
print(s)

b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64


#### Create a Series from Scalar
If data is a scalar value, an index must be provided. The value will be repeated to match the length of index

In [8]:
s = pd.Series(5, index=[0, 1, 2, 3])
print(s)

0    5
1    5
2    5
3    5
dtype: int64


### Data Access

#### Access Series with position
Data in the series can be accessed similar to that in an ndarray.

In [9]:
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
print(s[0])      #retrieve the first element

1


In [10]:
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
print(s[:3])        #retrieve the first three element

a    1
b    2
c    3
dtype: int64


In [11]:
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
print(s[-3:])       #retrieve the last three element

c    3
d    4
e    5
dtype: int64


#### Access Series with Lable/Index
A Series is like a fixed-size dict in that you can get and set values by index label.

In [12]:
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
print(s['a'])        #retrieve a single element

1


In [13]:
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
print(s[['a','c','d']])  #retrieve multiple elements

a    1
c    3
d    4
dtype: int64


## DataFrame
A Data frame is a two-dimensional data structure:
+ Potentially columns are of different types
+ Size – Mutable
+ Labeled axes (rows and columns)
+ Can Perform Arithmetic operations on rows and columns

![](http://drive.google.com/uc?export=view&id=14xOa0qU6Q4LJWNanj0DZbEhfnYp67TGb "You can think of it as an SQL table or a spreadsheet data representation.")

DataFrame Basic Functionality

|Attribute & Method | Description |
|-----|---------|
|`T`|Transposes rows and columns|
|`axes`|Returns a list with the row axis labels and column axis labels as the only members|
|`dtypes`|Returns the dtypes in this object|
|`empty`|True if NDFrame is entirely empty [no items]; if any of the axes are of length 0|
|`ndim`|Number of axes / array dimensions|
|`shape`|Returns a tuple representing the dimensionality of the DataFrame|
|`size`|Number of elements in the NDFrame.|
|`values`|Numpy representation of NDFrame|
|`head()`|Returns the first n rows|
|`tail()`|Returns last n rows|





### Create/Initialize DataFrame
`pandas.DataFrame( data, index, columns, dtype, copy)`

![](http://drive.google.com/uc?export=view&id=1BQfWbdcpqT_w0K5fLn1oT4K8oO4e1hcR)

-----------------------------------

The initialization of Pandas dataframe depends on the input params:
+ `list` of values: each instance is a record
+ `dict`: of values/Np-array/list: each dictionary is a column
+ `list` of `dict`: each dictionary is a column
+ `dict` of `sequence`: each dictionary is a column


***Summary**:
+ list of objects: each elementary object is a record (row)
+ dictionary of objects: each elementary object:
    + key: column name
    + value: column values of that columns


#### Create an Empty DataFrame
A basic DataFrame, which can be created is an Empty Dataframe.

In [14]:
df = pd.DataFrame()
print(df)

Empty DataFrame
Columns: []
Index: []


#### Create a DataFrame from Lists

In [15]:
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print(df)

   0
0  1
1  2
2  3
3  4
4  5


In [16]:
# List of instances

data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print(df)

     Name  Age
0    Alex   10
1     Bob   12
2  Clarke   13


In [17]:
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'],dtype=float)
print(df)

# Observe, the dtype parameter changes the type of Age column to floating point.

     Name   Age
0    Alex  10.0
1     Bob  12.0
2  Clarke  13.0


#### Create a DataFrame from Dict of ndarrays / Lists
All the ndarrays must be of same length.
+ If index is passed, then the length of the index should equal to the length of the arrays.
+ If no index is passed, then by default, index will be range(n), where n is the array length.

In [18]:
# Each dictionary presenting each column
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
print(df)

    Name  Age
0    Tom   28
1   Jack   34
2  Steve   29
3  Ricky   42


In [19]:
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
print(df)

# Note − Observe, the index parameter assigns an index to each row.

        Name  Age
rank1    Tom   28
rank2   Jack   34
rank3  Steve   29
rank4  Ricky   42


#### Create a DataFrame from List of Dicts
List of Dictionaries can be passed as input data to create a DataFrame. The dictionary keys are by default taken as column names.

In [20]:
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print(df)

   a   b     c
0  1   2   NaN
1  5  10  20.0


In [21]:
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data, index=['first', 'second'])
print(df)

        a   b     c
first   1   2   NaN
second  5  10  20.0


In [22]:
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]

#With two column indices, values same as dictionary keys
df1 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b'])

#With two column indices with one index with other name
df2 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b1'])
print(df1)
print()
print(df2)

        a   b
first   1   2
second  5  10

        a  b1
first   1 NaN
second  5 NaN


#### Create a DataFrame from Dict of Series

In [23]:
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print (df)

# Note − Observe, for the series one, there is no label ‘d’ passed, 
# but in the result, for the d label, NaN is appended with NaN.
# Let us now understand column selection, addition, and deletion through examples.

   one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4


### Data Access

+ Selection
    + *Row*: `df.loc['row-name']` or `df.iloc[row-index]` (can be used with multiple indices)
    + *Column*: `df['column-name']` (single-column only)
+ Addition
    + *Row*: `df.append(new-df)` with `new-df` has the **same n.o column** and the **same n.o column-name**
    + *Column*: set `df['new-column-name']=...`
+ Deletion
    + *Row*: `df = df.drop(row-index)`
    + *Column*: `del df['column-name']` or df.pop('column-name')


#### Column Selection

In [24]:
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)

print(df)

print(df['one'])



   one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4
a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64


#### Column Addition

In [25]:
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)

# Adding a new column to an existing DataFrame object with column label by passing new series

print ("Adding a new column by passing as Series:")
df['three']=pd.Series([10,20,30],index=['a','b','c'])
print(df)

print()

print ("Adding a new column using the existing columns in DataFrame:")
df['four']=df['one']+df['three']

print(df)

Adding a new column by passing as Series:
   one  two  three
a  1.0    1   10.0
b  2.0    2   20.0
c  3.0    3   30.0
d  NaN    4    NaN

Adding a new column using the existing columns in DataFrame:
   one  two  three  four
a  1.0    1   10.0  11.0
b  2.0    2   20.0  22.0
c  3.0    3   30.0  33.0
d  NaN    4    NaN   NaN


#### Column Deletion

In [26]:
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']), 
   'three' : pd.Series([10,20,30], index=['a','b','c'])}

df = pd.DataFrame(d)
print ("Our dataframe is:")
print(df)

# using del function
print ("Deleting the first column using DEL function:")
del df['one']
print(df)

# using pop function
print ("Deleting another column using POP function:")
df.pop('two')
print(df)

Our dataframe is:
   one  two  three
a  1.0    1   10.0
b  2.0    2   20.0
c  3.0    3   30.0
d  NaN    4    NaN
Deleting the first column using DEL function:
   two  three
a    1   10.0
b    2   20.0
c    3   30.0
d    4    NaN
Deleting another column using POP function:
   three
a   10.0
b   20.0
c   30.0
d    NaN


#### Row Selection, Addition, and Deletion

##### Selection by Label

In [27]:
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print(df,'\n')
print(df.loc['b'])

   one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4 

one    2.0
two    2.0
Name: b, dtype: float64


##### Selection by integer location

In [28]:
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print(df,'\n')
print(df.iloc[2])

   one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4 

one    3.0
two    3.0
Name: c, dtype: float64


##### Slice Rows

In [29]:
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
     'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print(df,'\n')
print(df[2:4])

   one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4 

   one  two
c  3.0    3
d  NaN    4


##### Addition of Rows

In [30]:
df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])

df = df.append(df2)
print(df)

   a  b
0  1  2
1  3  4
0  5  6
1  7  8


##### Deletion of Rows

In [31]:
df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])

df = df.append(df2)

# Drop rows with label 0
df = df.drop(0)

print(df)

   a  b
1  3  4
1  7  8


# Descriptive Statistics for DataFrame

|Function|Description|
|-------|--------|
|`count()`|	Number of non-null observations|
|`sum()`|	Sum of values|
|`mean()`|Mean of Values|
|`median()`|Median of Values|
|`mode()	`|	Mode of values|
|`std()`|Standard Deviation of the Values|
|`min()`|Minimum Value|
|`max()`|Maximum Value|
|`abs()`|Absolute Value|
|`prod()`|Product of Values|
|`cumsum()`|Cumulative Sum|
|`cumprod()	`|Cumulative Product|
|`describe()`|computes a summary of statistics pertaining to the DataFrame columns.|

*Note*: `df.describe()` gives the mean, std and IQR values and excludes the character columns and given summary about **numeric** columns. 
+ `include` is the argument which is used to pass necessary information regarding what columns need to be considered for summarizing. Takes the list of values; by default, 'number'.
    + object − Summarizes String columns
    + number − Summarizes Numeric columns
    + all − Summarizes all columns together (Should not pass it as a list value)






In [32]:
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}

#Create a DataFrame
df = pd.DataFrame(d)
print(df.describe())

             Age     Rating
count  12.000000  12.000000
mean   31.833333   3.743333
std     9.232682   0.661628
min    23.000000   2.560000
25%    25.000000   3.230000
50%    29.500000   3.790000
75%    35.500000   4.132500
max    51.000000   4.800000


In [33]:
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}

#Create a DataFrame
df = pd.DataFrame(d)
print( df.describe(include=['object']))

        Name
count     12
unique    12
top     Jack
freq       1


In [34]:
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}

#Create a DataFrame
df = pd.DataFrame(d)
print(df.describe(include='all'))

        Name        Age     Rating
count     12  12.000000  12.000000
unique    12        NaN        NaN
top     Jack        NaN        NaN
freq       1        NaN        NaN
mean     NaN  31.833333   3.743333
std      NaN   9.232682   0.661628
min      NaN  23.000000   2.560000
25%      NaN  25.000000   3.230000
50%      NaN  29.500000   3.790000
75%      NaN  35.500000   4.132500
max      NaN  51.000000   4.800000
