<table class="table table-bordered">
    <tr>
        <th style="text-align:center;"><h3>ANLY104 - Notebook 0</h3><h3>Pandas - Series and DataFrames</h3></th>
    </tr>
</table>

### Learning Outcomes

At the end of this lesson, you should be able to:
<ul>
<li>Import Pandas Library</li>
<li>Create and manipulate Pandas Series (with custom indices)</li>
<li>Generate descriptive statistics for Pandas Series</li>
<li>Create Pandas DataFrames (with custom indices)</li>
<li>Access selectively some rows and columns of a DataFrame using loc and iloc attributes</li>
<li>Understand Boolean indexing for DataFrames</li>
<li>Generate descriptive statistics for Pandas DataFrames</li>
</ul>

## <font color = "blue">Part I - Series Data Structure in Pandas</font>

Using the video as per the Homework list.

### Importing the Pandas Library

In [2]:
# Import Pandas Library.
# Pandas stands for "panel data"
# Panel data is data that is measured over a period of time.
import pandas as pd

### Creating a Pandas Series

In [3]:
# Create a Series.
# A Series is a 1-D data structure can have multiple data types, unlike a list.
grades = pd.Series([87, 77, 90])

### Displaying a Pandas Series

In [4]:
# Display the entire Series.
# Indices are on the left, data values on the right.
grades

0    87
1    77
2    90
dtype: int64

### Accessing the Pandas Series elements

In [5]:
# Get the data value stored at index 0.
# Remember that Python starts counting from 0.
grades[0]

np.int64(87)

In [6]:
# Get the data value stored at index 2, i.e. the third element.
grades[2]

np.int64(90)

### Generating descriptive statistics for Pandas Series 

In [7]:
# Min value of the Series
grades.min()

np.int64(77)

In [8]:
# Max value of the Series
grades.max()

np.int64(90)

In [9]:
# Number of elements in the Series
grades.count()

np.int64(3)

In [10]:
# Standard deviation
grades.std()

np.float64(6.8068592855540455)

In [12]:
# Summary statistics
grades.describe()

count     3.000000
mean     84.666667
std       6.806859
min      77.000000
25%      82.000000
50%      87.000000
75%      88.500000
max      90.000000
dtype: float64

### Creating a Pandas series with custom indices

In [13]:
# Create a List of strings.
# A list is a 1-D data structure can have only one data type, unlike a Series.
names = ['Sam', 'Andy', 'Dan']

In [14]:
# Create a series, with numbers as the data, and the names as the indices.
grades = pd.Series([87, 77, 90], names)

In [15]:
# Display the entire Series.
# Indices are on the left, data values on the right.
grades

Sam     87
Andy    77
Dan     90
dtype: int64

### Accessing elements with custom indices

In [16]:
# Get the data stored at 'Sam'
grades['Sam']

np.int64(87)

In [17]:
# Get the data stored at 'Dan'
grades['Dan']

np.int64(90)

### Creating a Pandas series with strings

In [18]:
# Create a Series storing strings
hardware = pd.Series(['hammer','saw','wrench'])

In [19]:
# Displaying the entire Series.
# Indices are on the left, data values on the right.
hardware

0    hammer
1       saw
2    wrench
dtype: object

Note that if a `Series` contains strings, you can call the string method `contains` on the `Series` to check which elements contain a particular substring. 

Please check the following link:<br />
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html

In [20]:
# For each element in the Series, does it contain an 'a'?
hardware.str.contains('a')

0     True
1     True
2    False
dtype: bool

You can also use the `upper()` string method to change all the elements of the `Series` to uppercase.

In [21]:
# Change all elements in the Series to uppercase
hardware.str.upper()

0    HAMMER
1       SAW
2    WRENCH
dtype: object

Let's do some practice!

1) Create a new `Series` named `students` that displays the `weight` as shown in the expected output.

Expected output:

    Alan       50
    Ben        65
    Charlie    68
    Daniel     75
    Ethan      55
    dtype: int64

In [22]:
students = pd.Series([50, 65, 68, 75, 55], ["Alan", "Ben", "Charlie", "Daniel", "Ethan"])
students

Alan       50
Ben        65
Charlie    68
Daniel     75
Ethan      55
dtype: int64

2) What's the standard deviation of the weight in the series?

Expected output:
    
    10.064790112068906

In [23]:
students.std()

np.float64(10.064790112068906)

3) What's the weight for Charlie?

Expected output:

    68

In [24]:
students["Charlie"]

np.int64(68)

4) What's the median of the weight in the series?

Expected output:

    65.0

In [25]:
students.median()

np.float64(65.0)

## <font color = "blue">Part II - DataFrames Data Structure in Pandas</font>

Using the video as per the Homework list.

### Pandas DataFrames

Let's talk about `DataFrames` now. <br />
In a Pandas DataFrame, each column is a Pandas `Series`. <br />
And a `DataFrame` can be built from a `dictionary`.

In [26]:
# Create a dictionary, which contains a list (of marks) associated with each key (name)
grades_dict = {'Walley':[87,96,70], 'Eva':[100,87,90], 'Sam':[94,77,90],'Katie':[100,81,82],'Bob':[83,65,85]}

In [27]:
# Create a DataFrame directly from a dictionary structure.
# A DataFrame is a 2-D structure where each column in the DataFrame is a Series.
grades = pd.DataFrame(grades_dict)

In [28]:
# Display the DataFrame 
grades

Unnamed: 0,Walley,Eva,Sam,Katie,Bob
0,87,100,94,100,83
1,96,87,77,81,65
2,70,90,90,82,85


Note that in a `DataFrame`, each column represents a (key/list) pair of the `dictionary`. The `key` is used as the header of a column, and the  `values` under that header are the elements of the list.

In [29]:
# Give custom indices to the DataFrame: use the index attribute
grades = pd.DataFrame(grades_dict, index = ['Test1','Test2','Test3'])
grades

Unnamed: 0,Walley,Eva,Sam,Katie,Bob
Test1,87,100,94,100,83
Test2,96,87,77,81,65
Test3,70,90,90,82,85


In [30]:
# The grades of one particular student, for all the tests
grades['Sam']

Test1    94
Test2    77
Test3    90
Name: Sam, dtype: int64

Note that selecting rows with the square brackets `[ ]` notation is a limiting functionality because it <b>doesn't allow to select any subset of rows</b> (for example alternate rows).<br />
This is where the `loc` and `iloc` indexing method comes in.

### Using loc indexer to select rows and columns with Pandas DataFrames

The `Pandas` `loc` indexer can be used with `DataFrames` to select rows or columns by <b>label</b>.

In [31]:
# Get the grade of all students for Test2 only
grades.loc['Test2']

Walley    96
Eva       87
Sam       77
Katie     81
Bob       65
Name: Test2, dtype: int64

In [32]:
# Get the grade of all students for Test1 and Test3 only.
# Note that we need here the double square bracket notation because the rows 
# have to be specified in a list.
grades.loc[['Test1','Test3']]

Unnamed: 0,Walley,Eva,Sam,Katie,Bob
Test1,87,100,94,100,83
Test3,70,90,90,82,85


In [33]:
# Get the grade of Sam, Eva and Bob only and for Test1 and Test3 only.
# Note that we also need the double square bracket notation for the 
# sub-list of rows and the sub-list of columns.
grades.loc[['Test1','Test3'],['Sam', 'Eva','Bob']]

Unnamed: 0,Sam,Eva,Bob
Test1,94,100,83
Test3,90,90,85


### Using iloc indexer to select rows and columns with Pandas DataFrames

The `Pandas` `iloc` indexer can be used with `DataFrames` selecting rows by <b>index</b> (or integer position).

In [34]:
# Get the grade of all students for Test2 only,
# which is row 1 (remember first row is row 0)
grades.iloc[1]

Walley    96
Eva       87
Sam       77
Katie     81
Bob       65
Name: Test2, dtype: int64

In [35]:
# Same result as with:
grades.loc['Test2']

Walley    96
Eva       87
Sam       77
Katie     81
Bob       65
Name: Test2, dtype: int64

Note that it gives the same result as with the loc indexer . 
- `loc` is used when the row is specified as a <b>label</b> (string)
- `iloc` is used when the row is specified as an <b>index</b> (integer)

In [36]:
# Get the grade of all students for Test1 and Test3 only
# (rows 0 and 2)
grades.iloc[[0,2]]

Unnamed: 0,Walley,Eva,Sam,Katie,Bob
Test1,87,100,94,100,83
Test3,70,90,90,82,85


In [37]:
# Get the grade of Sam, Eva and Bob only and for Test1 and Test3 only
# (i.e. rows 0 and 2; colmns 2, 1 and 4)
grades.iloc[[0,2],[2,1,4]]

Unnamed: 0,Sam,Eva,Bob
Test1,94,100,83
Test3,90,90,85


### Slicing : Other way to extract data from a DataFrame

Another way to extract certain rows and columns from a DataFrame is to use the slicing or `:` notation. <br />
For example `grades.loc['Test1':'Test3']` extracts grades of all students for `Test1` to `Test3`. <br />
So the slicing notation `:` helps to define a range.

Slicing can be use with either with `loc` or `iloc` indexing.

Let's extract the rows labelled `Test1` to `Test3`, usinc the `loc` indexing and slicing notation`:`.

In [38]:
# Get all rows between Test1 and Test3 inclusive
grades.loc['Test1':'Test3']

Unnamed: 0,Walley,Eva,Sam,Katie,Bob
Test1,87,100,94,100,83
Test2,96,87,77,81,65
Test3,70,90,90,82,85


Let's do the same with `iloc` indexing. Here all the rows from index 0 to index 2 will be displayed (Remember that 3 will be an excluded value). 

In [39]:
# Get all rows between row 0 inclusive and row 3 exclusive
grades.iloc[0:3]

Unnamed: 0,Walley,Eva,Sam,Katie,Bob
Test1,87,100,94,100,83
Test2,96,87,77,81,65
Test3,70,90,90,82,85


In [40]:
# Get all rows between Test1 and Test 2 inclusive, but only columns Eva and Katie
grades.loc['Test1':'Test2',['Eva','Katie']]

Unnamed: 0,Eva,Katie
Test1,100,100
Test2,87,81


In [41]:
# You can use iloc indexing to get the same result.
# So this gets rows 0 inclusive to 2 exclusive, and columns 1 and 3.
# Remember that with slicing the end index is excluded.
grades.iloc[0:2,[1,3]]

Unnamed: 0,Eva,Katie
Test1,100,100
Test2,87,81


Let's do some practice!

1) Create a new `DataFrame` named `sales_df` that displays the following output.

Expected output:

             Tablets Handphones	Laptops	Watches
    Singapore	5000	8000	9000	1000
    Malaysia	3500	5000	7000	800
    Thailand	4000	9000	4000	2000

In [42]:
dictionary = {"Tablets": [5000, 3500, 4000], "Handphones": [8000, 5000, 9000], "Laptops": [9000, 7000, 4000], "Watches": [1000, 800, 2000]}
sales_df = pd.DataFrame(dictionary, ['Singapore', "Malaysia", "Thailand"])
sales_df

Unnamed: 0,Tablets,Handphones,Laptops,Watches
Singapore,5000,8000,9000,1000
Malaysia,3500,5000,7000,800
Thailand,4000,9000,4000,2000


2) Display the `Handphones` and `Watches` sales for `Singapore` and `Thailand`.

Expected output:

               Handphones Watches
    Singapore	8000		1000
    Thailand	9000		2000

In [52]:
# sales_df.iloc[[0, 2], [1, 3]]
sales_df.loc[["Singapore", "Thailand"], ["Handphones", "Watches"]]

Unnamed: 0,Handphones,Watches
Singapore,8000,1000
Thailand,9000,2000


3) What's the average sales for `Singapore`?

Expected output:

    5750.0

In [None]:
sales_df.loc["Singapore"].mean()

np.float64(5750.0)

4) What's the maximum sales for Laptops?

Expected output:

    9000

In [70]:
sales_df["Laptops"].max()

np.int64(9000)

#### Happy Coding!!!