# Python Library for Data Science

There are many popular Python toolboxes/libraries:
* Numpy
* Scipy
* Pandas
* SciKit-Learn

Visualization library
* Matplotlib
* Seabord

## Numpy


* introduces objects for multidimensional arrays and matrices, as well as functions that allow to easily perform advanced mathematical and statistical operations on those objects
* provides vectorization of mathematical operations on arrays and matrices which significantly improves the performance
* many other python libraries are built on NumPy


## SciPy

* collection of algorithms for linear algebra, differential equations, numerical integration, optimization, statistics and more
* built on NumPy


## Pandas

* adds data structures and tools designed to work with table-like data (similar to Series and Data Frames in R)
* provides tools for data manipulation: reshaping, merging, sorting, slicing, aggregation etc.
* allows handling missing data

## SciKit-Learn
* provides machine learning algorithms: classification, regression, clustering, model validation etc.
* built on NumPy, SciPy and matplotlib

## matplotlib

* python 2D plotting library which produces publication quality figures in a variety of hardcopy formats
* a set of functionalities similar to those of MATLAB
* line plots, scatter plots, barcharts, histograms, pie charts etc.
* relatively low-level; some effort needed to create advanced visualization




## Seaborn

* based on matplotlib
* provides high level interface for drawing attractive statistical graphics
* Similar (in style) to the popular ggplot2 library in R


# Pandas

* Open-source
* High-performance
* Easy to use data structure
* Data analysis tools
* The Data like Excel


## What Pandas can do

* Modeling the data
* Create the data frame
* Series
 - One-dimension array
 - Similar to the Numpy arrays


## Pandas - Data Frame

* Data frame
  * The spreadsheet like
* Using to prepare data
  * For data manipulation

## Data Frame data types
![image-20230806153235679](./assets/image-20230806153235679.png)

## Data Frame attribute
Python objects have *attributes* and *methods*
![image-20230806153331618](./assets/image-20230806153331618.png)


# Importing the module

import the Pandas package
```python
import pandas as pd
import numpy as np
```

In [1]:
import pandas as pd
import numpy as np

In [3]:
import pandas as pd
import numpy as np

ModuleNotFoundError: No module named 'pandas'

In [None]:
#Import Python Libraries

# Data Structure - Series

**Series** (1d homogeneous array)

Similar to the NumPy data type

The simple array can be created as given


```python
obj = ([4,7,-5,3])
obj
```

In [2]:
obj = ([4,7,-5,3])
obj

[4, 7, -5, 3]

The Series in Pandas can created

```python
obj = pd.Series([4,7,-5,3])
obj
```

In [3]:
obj = pd.Series([4,7,-5,3])
obj

0    4
1    7
2   -5
3    3
dtype: int64

Normally the index is added automatically (The index 0-3 is shown in the previous sesion)
The index, and value can be shown as given

```python
obj.values
```


In [4]:
obj.values

array([ 4,  7, -5,  3])

```python
obj.index
```

In [5]:
obj.index

RangeIndex(start=0, stop=4, step=1)

The index can be created to refer to each data

```python
obj2 = pd.Series([4,7,-5,3],index=['d','b','a','c'])
obj2
```

In [6]:
obj2 = pd.Series([4,7,-5,3],index=['d','b','a','c'])
obj2

d    4
b    7
a   -5
c    3
dtype: int64

Then we can check all indexs and value using the values, and indexs data
```python
obj2.values
```

In [7]:
obj2.values

array([ 4,  7, -5,  3])

```python
obj2.index
```

In [8]:
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

To get some data, we can slice the data from the series using index, or index

```python
obj2['a']
```

In [9]:
obj2['a']

np.int64(-5)

```python
obj2[['b','c']]
```

In [10]:
obj2[['b','c']]

b    7
c    3
dtype: int64

We can also use the dict data structure to a series

```python
sdata = {'Ohio':3500, 'Texas':71000,'Oregon':16000, 'Utah':5000}
obj3 = pd.Series(sdata)
obj3
```

In [11]:
sdata = {'Ohio':3500, 'Texas':71000,'Oregon':16000, 'Utah':5000}
obj3 = pd.Series(sdata)
obj3

Ohio       3500
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

When input the index parameter, only the value that match the index will be shown

```python
states = ['California','Ohio','Oregon','Texas']
obj4 = pd.Series(sdata,index = states)
obj4
```

In [12]:
states = ['California','Ohio','Oregon','Texas']
obj4 = pd.Series(sdata,index = states)
obj4

California        NaN
Ohio           3500.0
Oregon        16000.0
Texas         71000.0
dtype: float64

`Nan` is the data that does not provide the data for the index

We can filter the data by adding the boolean in the index
```python
obj4[obj4<20000]
```

In [13]:
obj4[obj4<20000]

Ohio       3500.0
Oregon    16000.0
dtype: float64

To get the name of index use index data
```python
obj4[obj4 <20000].index
```

In [14]:
obj4[obj4 <20000].index

Index(['Ohio', 'Oregon'], dtype='object')

## Task Data Structure - Series Hand-ons
create a series of students using the student id as an index, and the name as the first values. The name and student id is given


![image-20230806154629050](./assets/image-20230806154629050.png)

In [16]:
import pandas as pd

In [20]:
data = {
    660632025: "Tide",
    660632027: "Up",
    660632028: "Art",
    660632030: "Temp",
    660632031: "Tung",
    660632032: "Kheng",
    660632033: "Pu",
    660632034: "Big",
    660632035: "Toey",
    660632036: "Sa",
    660632037: "J",
    660632038: "Pokong",
    660632060: "Chompoo",
    660632062: "Pokpak",
    660632064: "Ning",
    660632065: "Peuam",
    660632066: "Bew",
    660632067: "Aon",
    660632068: "Poom",
    660632071: "Tern",
    660632076: "Fang",
    660632079: "Dem",
    660632277: "Nait"
}

students_series = pd.Series(data)

print(students_series)

660632025       Tide
660632027         Up
660632028        Art
660632030       Temp
660632031       Tung
660632032      Kheng
660632033         Pu
660632034        Big
660632035       Toey
660632036         Sa
660632037          J
660632038     Pokong
660632060    Chompoo
660632062     Pokpak
660632064       Ning
660632065      Peuam
660632066        Bew
660632067        Aon
660632068       Poom
660632071       Tern
660632076       Fang
660632079        Dem
660632277       Nait
dtype: object


# Data Structure - Dataframe

The tubular, spreedsheet-like data structure.

contains and ordered collection of columns

Can be though as a Dict of series

The data frame is used to manipulate the data, and we can extract the output of the data science modules by extracting the value in the Daaframe
we can try to create the data frame as given

```python
data = {'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],
        'year' :[2000  ,2001  ,2002  ,2001    ,2002],
        'pop'  :[1.5   ,1.7   ,3.6   ,2.4     ,2.9]}
frame = pd.DataFrame(data)
frame
```

In [21]:
data = {'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],
        'year' :[2000  ,2001  ,2002  ,2001    ,2002],
        'pop'  :[1.5   ,1.7   ,3.6   ,2.4     ,2.9]}
frame = pd.DataFrame(data)
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


We can also arrange the column using the columns parameter

```python
frame2 = pd.DataFrame(data, columns=['year','state','pop'])
frame2
```

In [22]:
frame2 = pd.DataFrame(data, columns=['year','state','pop'])
frame2

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9


The index can be set, and the column which is not in the data is also shown as Nan

```python
frame2 = pd.DataFrame(data, columns=['year','state','pop','debt'],
                      index=['one','two','three','four','five'])
frame2
```


In [23]:
frame2 = pd.DataFrame(data, columns=['year','state','pop','debt'],
                      index=['one','two','three','four','five'])
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,


We can extract data regarding to the columns as given

```python
frame2['state']
```

In [24]:
frame2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

```python
frame2.year
```

In [25]:
frame2.year

one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64

we can get the data from each index using `loc` methods

```python
frame2.loc['three']
```

In [26]:
frame2.loc['three']

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

We can assign the data to each column using the scalar data, or list

```python
frame2['debt'] = 16.5
frame2
```

In [27]:
frame2['debt'] = 16.5
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16.5
two,2001,Ohio,1.7,16.5
three,2002,Ohio,3.6,16.5
four,2001,Nevada,2.4,16.5
five,2002,Nevada,2.9,16.5


```python
frame2['debt'] = np.arange(5)
frame2
```

In [28]:
frame2['debt'] = np.arange(5)
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0
two,2001,Ohio,1.7,1
three,2002,Ohio,3.6,2
four,2001,Nevada,2.4,3
five,2002,Nevada,2.9,4


Or we can add the series as the missing column

```python
val = pd.Series([-1.2,-1.5,-1.7],index=['two','four','five'])
frame2['debt'] = val
frame2
```

In [29]:
val = pd.Series([-1.2,-1.5,-1.7],index=['two','four','five'])
frame2['debt'] = val
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7


```python
frame.describe()
```

In [30]:
frame.describe()

Unnamed: 0,year,pop
count,5.0,5.0
mean,2001.2,2.42
std,0.83666,0.864292
min,2000.0,1.5
25%,2001.0,1.7
50%,2001.0,2.4
75%,2002.0,2.9
max,2002.0,3.6


Other form of creating the the data frame is a nested dict of dicts format

```python
pop = {'Nevada': {2001:2.4,2002:2.9},
       'Ohio'  : {2000:1.5,2001:1.7, 2002: 3.6}}
frame3 = pd.DataFrame(pop)
frame3
```

In [31]:
pop = {'Nevada': {2001:2.4,2002:2.9},
       'Ohio'  : {2000:1.5,2001:1.7, 2002: 3.6}}
frame3 = pd.DataFrame(pop)
frame3

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


`Nan`, means the data is not provided.
The Pandas Dataframe is the symetrix tuple, it should provides as a table.

We can transpose the result


```python
frame3.T
```

In [32]:
frame3.T

Unnamed: 0,2001,2002,2000
Nevada,2.4,2.9,
Ohio,1.7,3.6,1.5


## Task Data Structure - Data frame

From your previous work,
Add the column midterm score, and attendance score to all the students