# Pandas Data Structures

Data structure in pandas includes ```Series``` and ```DataFrame``` which are built on top of NumPy.So befor starting, we need to import ```NumPy``` and ```Pandas``` libraries

In [2]:
# import the numpy and pandas libraries and aliasing as np and pd respectively

import numpy as np
import pandas as pd

## Series

A series is a one-dimensional object and can be created using various inputs like ```Array```, ```Dict```, and ```Scalar value or constant```. By default, each value in a series will receive an index from 0 to N-1, which N is the length of the data.

In [2]:
# example of creating a simple series

Ser = pd.Series ([3.14, "python", -10, 'BC34'])
print (Ser)

0      3.14
1    python
2       -10
3      BC34
dtype: object


You can specify an index to each data in the series like below:

In [3]:
Ser = pd.Series ([3.14, "python", -10, 'BC34'], 
                 index = ['A', 'B', 'C', 'D'])
print (Ser)

A      3.14
B    python
C       -10
D      BC34
dtype: object


In [4]:
Ser.values

array([3.14, 'python', -10, 'BC34'], dtype=object)

Using index for calling values in a series.

In [5]:
Ser[['C','B',]]   # Using index for calling values in a series.

C       -10
B    python
dtype: object

### Creating a series by passing the dictionary

In [6]:
Data = {'Name': ['Bob', 'John', 'Mary'], 'Age': [15, 23, 17], 'Color': ['white', 'black', 'black']}

Sdata = pd.Series(Data)
print (Sdata)

Age               [15, 23, 17]
Color    [white, black, black]
Name         [Bob, John, Mary]
dtype: object


In this example, the dict' s keys are indexes in Data. So you can recall values by using these keys:

In [7]:
Features = ['Name', 'Age', 'Color', 'Weigth']

Sdata = pd.Series (Data, index = Features)
print (Sdata)

Name          [Bob, John, Mary]
Age                [15, 23, 17]
Color     [white, black, black]
Weigth                      NaN
dtype: object


Note: Since we do not have any value for ```Weigth``` in 'Data' dictionary, it appears as NaN. This kind of data is considered as 'missing data' or 'NA values'.

In big data, detecting missing data is essential. For this purpose, The ```isnull``` and ```notnull``` functions should be used.

In [8]:
pd.isnull(Sdata)

Name      False
Age       False
Color     False
Weigth     True
dtype: bool

In [9]:
pd.notnull(Sdata)

Name       True
Age        True
Color      True
Weigth    False
dtype: bool

In [10]:
# Retrieve some elements from a series

Ser = pd.Series ([1,2,3,4,5,6,7], index = ['a','b','c','d','e','f','g'])

print (Ser[1:5])

b    2
c    3
d    4
e    5
dtype: int64


In [11]:
print (Ser[-3:])

e    5
f    6
g    7
dtype: int64


In [12]:
#Retrieve data using index

print (Ser [['a','d','f','g']])

a    1
d    4
f    6
g    7
dtype: int64


### Creating a series by passing the scalar

If data ia a scalar value, the value will be repeated to the number of indexes. The important point is, an index must be provided in the series.

In [14]:
Ser = pd.Series (23 , index = [0,1,2,3,4,5])
print (Ser)

0    23
1    23
2    23
3    23
4    23
5    23
dtype: int64


## Basic Functionality in Series

```axes``` Returns a list of the row axis labels.

```dtype``` Returns the dtype of the object.

```empty``` Returns True if series is empty.

```ndim```  Returns the number of dimensions of the underlying data.

```size``` Returns the number of elements in the underlying data.

```values``` Returns the Series as ndarray.

```head()``` Returns the first n rows.

```tail()``` Returns the last n rows.



(Reference :www.tutotialspoint.com/python_pandas)

#### The structure of using these functions is like below:

#### NameSeries.```function```

In [15]:
# Some example of using functions in Series:


Ser = pd.Series ([1,2,3,4,5,6,7], index = ['a','b','c','d','e','f','g'])

print ("The axes are: ")
print (Ser.axes)

print ("The dimentions of the object is: ")
print (Ser.ndim)

print ("The size of the object is: ")
print (Ser.size)

print ("The data in the Series is: ")
print (Ser.values)

print ("The first 4 rows of the data series: ")
print (Ser.head(4))


The axes are: 
[Index(['a', 'b', 'c', 'd', 'e', 'f', 'g'], dtype='object')]
The dimentions of the object is: 
1
The size of the object is: 
7
The data in the Series is: 
[1 2 3 4 5 6 7]
The first 4 rows of the data series: 
a    1
b    2
c    3
d    4
dtype: int64


## DataFrame

A ```DataFrame``` is a two-dimensional data structure, and contains a row and column index.So ```DataFrame```'s structure is like a tabular format.

A ```DataFrame``` can be created using various inputs like: ```List```, ```Dictionary```, ```Series```, and ```Numpy ndarrays```. 

### Creating a DataFrame by passing the Lists

In [16]:
Data = [100, 120, 130, 140, 150]

df = pd.DataFrame(Data)
print (df)

     0
0  100
1  120
2  130
3  140
4  150


In [17]:
raw_data = [['Jason','Miller',42,4,25],['Molly','Jacobson',52,24,94],['Tina','Alison',36,31,57],['Jake','Milner',24,2,62],
            ['Amy','Cooze',73,3,70]]

df = pd.DataFrame (raw_data, columns = ['first_name', 'last_name','age','preTestScore','postTestScore'])
print (df)


  first_name last_name  age  preTestScore  postTestScore
0      Jason    Miller   42             4             25
1      Molly  Jacobson   52            24             94
2       Tina    Alison   36            31             57
3       Jake    Milner   24             2             62
4        Amy     Cooze   73             3             70


### Creating a DataFrame by passing the dictionary

In [18]:
raw_data = {'firs_name': ['Jason','Molly','Tina','Jake','Amy'], 'last_name': ['Miller','Jacobson','Alison','Milner','Cooze'], 
           'age': [42,52,36,24,73], 'preTestScore': [4,24,31,2,3], 'postTestScore': [25,94,57,62,70]}

df = pd.DataFrame (raw_data , index = ['rank1','rank2','rank3','rank4','rank5'])
print (df)

       age firs_name last_name  postTestScore  preTestScore
rank1   42     Jason    Miller             25             4
rank2   52     Molly  Jacobson             94            24
rank3   36      Tina    Alison             57            31
rank4   24      Jake    Milner             62             2
rank5   73       Amy     Cooze             70             3


In [19]:
# Create a DataFrame from list of dicts

Data = [{'first_attempt':12, 'second_attempt':10.78,}, {'first_attempt':14.1, 'second_attempt':13.2, 'third_attempt':12}]

df = pd.DataFrame (Data)
print (df)

   first_attempt  second_attempt  third_attempt
0           12.0           10.78            NaN
1           14.1           13.20           12.0


In [20]:
# define index 

df = pd.DataFrame (Data, index = ['score1','score2'])
print (df)

        first_attempt  second_attempt  third_attempt
score1           12.0           10.78            NaN
score2           14.1           13.20           12.0


### Creating a DataFrame from Dict of Series

In [21]:

Data = {'first' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
      'second' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(Data)
print (df)

   first  second
a    1.0       1
b    2.0       2
c    3.0       3
d    NaN       4


### Column Addition

In [22]:
raw_data = {'first_name': ['Jason','Molly','Tina','Jake','Amy'], 'last_name': ['Miller','Jacobson','Alison','Milner','Cooze'], 
           'age': [42,52,36,24,73], 'preTestScore': [4,24,31,2,3], 'postTestScore': [25,94,57,62,70]}

df = pd.DataFrame (raw_data , index = ['rank1','rank2','rank3','rank4','rank5'])

print ("Original data: ")

print (df)

# adding a new column to an existing columns in DataFrame object

date = [2017, 2018,2017,np.nan,2015]

df["date"] = date

    
print ("New DataFrame after inserting the 'date' column")

print (df)

Original data: 
       age first_name last_name  postTestScore  preTestScore
rank1   42      Jason    Miller             25             4
rank2   52      Molly  Jacobson             94            24
rank3   36       Tina    Alison             57            31
rank4   24       Jake    Milner             62             2
rank5   73        Amy     Cooze             70             3
New DataFrame after inserting the 'date' column
       age first_name last_name  postTestScore  preTestScore    date
rank1   42      Jason    Miller             25             4  2017.0
rank2   52      Molly  Jacobson             94            24  2018.0
rank3   36       Tina    Alison             57            31  2017.0
rank4   24       Jake    Milner             62             2     NaN
rank5   73        Amy     Cooze             70             3  2015.0


In [23]:
Data = {'first' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
      'second' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(Data)
print (df)

# adding a new column to an existing columns in DataFrame object

df ['third'] = pd.Series([100,200,300,400], index = ['a','b','c','d'])

print ("New DataFrame after inserting the 'third' column")

print (df)

   first  second
a    1.0       1
b    2.0       2
c    3.0       3
d    NaN       4
New DataFrame after inserting the 'third' column
   first  second  third
a    1.0       1    100
b    2.0       2    200
c    3.0       3    300
d    NaN       4    400


### Column Deletion

In [24]:
raw_data = {'firs_name': ['Jason','Molly','Tina','Jake','Amy'], 'last_name': ['Miller','Jacobson','Alison','Milner','Cooze'], 
           'age': [42,52,36,24,73], 'preTestScore': [4,24,31,2,3], 'postTestScore': [25,94,57,62,70]}

df = pd.DataFrame(raw_data)

# dropping a column in DataFrame object by using drop function

df.drop('preTestScore', axis = 1)     # the argument axis=1 denotescolumn



Unnamed: 0,age,firs_name,last_name,postTestScore
0,42,Jason,Miller,25
1,52,Molly,Jacobson,94
2,36,Tina,Alison,57
3,24,Jake,Milner,62
4,73,Amy,Cooze,70


In [25]:
# This example shows we can use del function for dropping a column in DataFrame

Data = {'first' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
      'second' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(Data)

# using del function

del df['first']
print (df)

   second
a       1
b       2
c       3
d       4


## Basic Functionality in DataFrame

```T``` Transposes rows and columns.

```axes``` Returns a list of the row axis labels.

```dtype``` Returns the dtype of the object.

```empty``` Returns True if NDFrame is empty.

```ndim```  Returns the number of axes / array dimensions.

```size``` Returns the number of elements in the underlying data.

```values``` Returns the NDFrame.

```head()``` Returns the first n rows.

```tail()``` Returns the last n rows.



(Reference :www.tutotialspoint.com/python_pandas)

In [26]:
raw_data = {'first_name': ['Jason','Molly','Tina','Jake','Amy'], 'last_name': ['Miller','Jacobson','Alison','Milner','Cooze'], 
           'age': [42,52,36,24,73], 'preTestScore': [4,24,31,2,3], 'postTestScore': [25,94,57,62,70]}


# create a Dictionary of series

raw_data = {'first_name': pd.Series (['Jason','Molly','Tina','Jake','Amy']), 'last_name': pd.Series (['Miller','Jacobson','Alison','Milner','Cooze']),
           'age': pd.Series ([42,52,36,24,73]), 'preTestScore': pd.Series ([4,24,31,2,3]), 'postTestScore': pd.Series ([25,94,57,62,70])}

df = pd.DataFrame (raw_data)

print (df)

# Transpose

print ("The transpose of the data series is: ")
print (df.T)

# dtypes

print ("The data types of each column are: ")
print (df.dtypes)


# ndim

print ("The dimension is: ")
print (df.ndim)

# shape

print ("The shape is: ")
print (df.shape)

# size

print ("The total number of elements is: ")
print (df.size)

   age first_name last_name  postTestScore  preTestScore
0   42      Jason    Miller             25             4
1   52      Molly  Jacobson             94            24
2   36       Tina    Alison             57            31
3   24       Jake    Milner             62             2
4   73        Amy     Cooze             70             3
The transpose of the data series is: 
                    0         1       2       3      4
age                42        52      36      24     73
first_name      Jason     Molly    Tina    Jake    Amy
last_name      Miller  Jacobson  Alison  Milner  Cooze
postTestScore      25        94      57      62     70
preTestScore        4        24      31       2      3
The data types of each column are: 
age               int64
first_name       object
last_name        object
postTestScore     int64
preTestScore      int64
dtype: object
The dimension is: 
2
The shape is: 
(5, 5)
The total number of elements is: 
25


## More function in ```Series``` and ```DataFrame```

## Reindexing

Changing the order of the rows and columns in a ```Series``` or a ```DataFrame``` is a purpose of the ```reindexing``` function.

In [29]:
# Create DataFrame

raw_data = {'first_name': ['Jason','Molly','Tina','Jake','Amy'], 'last_name': ['Miller','Jacobson','Alison','Milner','Cooze'], 
           'age': [42,52,36,24,73], 'preTestScore': [4,24,31,2,3], 'postTestScore': [25,94,57,62,70]}

df = pd.DataFrame(raw_data)
print (df)

   age first_name last_name  postTestScore  preTestScore
0   42      Jason    Miller             25             4
1   52      Molly  Jacobson             94            24
2   36       Tina    Alison             57            31
3   24       Jake    Milner             62             2
4   73        Amy     Cooze             70             3


In [30]:
# reindex or change the order of rows

df.reindex ([3,1,4,0,2])

Unnamed: 0,age,first_name,last_name,postTestScore,preTestScore
3,24,Jake,Milner,62,2
1,52,Molly,Jacobson,94,24
4,73,Amy,Cooze,70,3
0,42,Jason,Miller,25,4
2,36,Tina,Alison,57,31


Note: If we invoke a ```Series``` or ```DataFrame``` using an input list containing a label that is not in the original DataFrame index, the new row is filled with null value or NaN.

In [33]:
# reindex or change the order of rows with new inputs

df.reindex ([3,1,7,4,0,2,6])

Unnamed: 0,age,first_name,last_name,postTestScore,preTestScore
3,24.0,Jake,Milner,62.0,2.0
1,52.0,Molly,Jacobson,94.0,24.0
7,,,,,
4,73.0,Amy,Cooze,70.0,3.0
0,42.0,Jason,Miller,25.0,4.0
2,36.0,Tina,Alison,57.0,31.0
6,,,,,


In [32]:
# reindex or change the order of columns

columnsTitles = ['first_name','last_name','age','preTestScore', 'postTestScore']

df.reindex (columns = columnsTitles)

Unnamed: 0,first_name,last_name,age,preTestScore,postTestScore
0,Jason,Miller,42,4,25
1,Molly,Jacobson,52,24,94
2,Tina,Alison,36,31,57
3,Jake,Milner,24,2,62
4,Amy,Cooze,73,3,70


##### Practice: Set new columnsTitles with new indexes and see the result if you add an input which is not in the DataFrame. 

In [38]:
# One more example:

Score = {'student1' : pd.Series([100, 93,87,100], index=['score1', 'score2', 'score3', 'score4']),
      'student2' : pd.Series([93,96,79,98], index=['score1', 'score2', 'score3', 'score4']),
         'student3' : pd.Series([100,99,96,89], index=['score1', 'score2', 'score3', 'score4'])}

df = pd.DataFrame(Score)
print (df)

        student1  student2  student3
score1       100        93       100
score2        93        96        99
score3        87        79        96
score4       100        98        89


In [42]:
df.reindex (['score2', 'score4', 'score1', 'score5'])


Unnamed: 0,student1,student2,student3
score2,93.0,96.0,99.0
score4,100.0,98.0,89.0
score1,100.0,93.0,100.0
score5,,,


## How select multiple rows and columns from a ```DataFrame```

#### By using integer labels```.iloc``` and axis labels```.loc``` functions, you are enable to select multiple rows and columns from a ```DataFrame```

### ```.iloc``` function

In [6]:
raw_data = {'first_name': ['Jason','Molly','Tina','Jake','Amy'], 'last_name': ['Miller','Jacobson','Alison','Milner','Cooze'], 
           'age': [42,52,36,24,73], 'preTestScore': [4,24,31,2,3], 'postTestScore': [25,94,57,62,70]}

df = pd.DataFrame(raw_data)
print (df)

   age first_name last_name  postTestScore  preTestScore
0   42      Jason    Miller             25             4
1   52      Molly  Jacobson             94            24
2   36       Tina    Alison             57            31
3   24       Jake    Milner             62             2
4   73        Amy     Cooze             70             3


In [11]:
# If we run this code, we will get a single row 
df.iloc[3]

age                  24
first_name         Jake
last_name        Milner
postTestScore        62
preTestScore          2
Name: 3, dtype: object

For getting the result in DataFrame format, we can pass this number in a list like:

In [13]:
df.iloc[[3]]

Unnamed: 0,age,first_name,last_name,postTestScore,preTestScore
3,24,Jake,Milner,62,2


In [14]:
df.iloc[[-1]]

Unnamed: 0,age,first_name,last_name,postTestScore,preTestScore
4,73,Amy,Cooze,70,3


In [18]:
#Selecting more than one row using .iloc 
df.iloc[[0,2]]

Unnamed: 0,age,first_name,last_name,postTestScore,preTestScore
0,42,Jason,Miller,25,4
2,36,Tina,Alison,57,31


In [37]:
#everything left to the comma belongs to rows and everything right to the comma belongs to the column.

df.iloc[[0,2],[1]]

Unnamed: 0,student2
score1,93
score3,79


In [20]:
df.iloc[0:3,1:3]

Unnamed: 0,first_name,last_name
0,Jason,Miller
1,Molly,Jacobson
2,Tina,Alison


### ```.loc``` function

```loc``` function operates on the labels in rows or columns

In [27]:
#example (introducing a data frame)

Score = {'student1' : pd.Series([100, 93,87,100], index=['score1', 'score2', 'score3', 'score4']),
      'student2' : pd.Series([93,96,79,98], index=['score1', 'score2', 'score3', 'score4']),
         'student3' : pd.Series([100,99,96,89], index=['score1', 'score2', 'score3', 'score4'])}

df = pd.DataFrame(Score)
print (df)

        student1  student2  student3
score1       100        93       100
score2        93        96        99
score3        87        79        96
score4       100        98        89


In [28]:
df.loc['score3']

student1    87
student2    79
student3    96
Name: score3, dtype: int64

For getting the result in DataFrame format, we can pass this number in a list like:

In [29]:
df.loc[['score3']]

Unnamed: 0,student1,student2,student3
score3,87,79,96


In [31]:
#everything left to the comma belongs to rows and everything right to the comma belongs to the column.

df.loc[['score2','score3'],['student2']]

Unnamed: 0,student2
score2,96
score3,79


In [36]:
df.loc['score1':'score2','student2':'student3']

Unnamed: 0,student2,student3
score1,93,100
score2,96,99


## Arithmetic Operations

#### ```add()```

#### ```sub()```

#### ````mul()````

#### ````div()````


In [46]:
# example of applying arithmetic operations

df = pd.DataFrame ({'first': pd.Series(np.random.randn(4), index = ['a','b','c','d']), 
                    'second': pd.Series(np.random.randn(4), index = ['a','b','c','d']), 
                    'third': pd.Series(np.random.randn(4), index = ['a','b','c','d'])})

print (df)

      first    second     third
a  0.508703 -0.322998 -0.992140
b  0.515813  0.115337  1.279374
c  0.869872  0.850733  2.083633
d -0.959646  2.869825 -0.932328


In [64]:
row = df.iloc[3]


df.add(row, axis =1)

Unnamed: 0,first,second,third
a,-0.450944,2.546827,-1.924468
b,-0.443833,2.985162,0.347046
c,-0.089775,3.720559,1.151305
d,-1.919293,5.739651,-1.864656


In [65]:
df.sub(row, axis = 1)

Unnamed: 0,first,second,third
a,1.468349,-3.192824,-0.059812
b,1.47546,-2.754489,2.211702
c,1.829518,-2.019092,3.015961
d,0.0,0.0,0.0


In [66]:
df.mul(row, axis = 1)

Unnamed: 0,first,second,third
a,-0.488175,-0.926949,0.924999
b,-0.494998,0.330997,-1.192796
c,-0.834769,2.441456,-1.942629
d,0.920921,8.235898,0.869235
