# Chapter 10 Data Analytics with Numpy and Pandas

# Numpy - numerical python 
Numpy is python library consisting of classes for mathematical like matrices, statistics, random numbers with *fast/efficient* algorithms/methods

# Pandas - basically Excel for Python
Pandas has a class `DataFrame` that is designed to processing with tabular data, i.e. spreadsheets.

From its Wikipedia page, on why it is called Pandas:
> The name is derived from the term **"panel data"**, an econometrics term for data sets that include observations over multiple time periods for the same individuals. Its name is a play on the phrase "Python data analysis" itself. Wes McKinney started building what would become pandas at AQR Capital while he was a researcher there from 2007 to 2010.

### Book *Python for Data Analysis*  by the creator of Pandas, Wes Mckinney: 
### https://wesmckinney.com/book/


### Pandas Github page: https://github.com/pandas-dev/pandas 

Anyone can see how Pandas library is coded/built, anyone can contribute to it, and it is constantly being developed!

### Pandas may be going out of fashion, and being replaced by a faster library Polars


## Numpy
### Lists and Numpy Arrays look the same but are completely different objects!

In [2]:
import numpy as np

my_list=[1, 3, 4.5]
my_array=np.array(my_list)

In [4]:
# lists and arrays look roughly the same when printed, although no commas in numpy array
print(my_list)
print(my_array)

[1, 3, 4.5]
[1.  3.  4.5]


In [5]:
# What class are these objects
print(type(my_list))
print(type(my_array))

<class 'list'>
<class 'numpy.ndarray'>


In [9]:
my_list*2

[1, 3, 4.5, 1, 3, 4.5]

In [10]:
my_array*2

array([2., 6., 9.])

In [12]:
#compare dir(my_list) vs dir(my_array)
dir(my_list)

['__add__',
 '__class__',
 '__class_getitem__',
 '__contains__',
 '__delattr__',
 '__delitem__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__reversed__',
 '__rmul__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'append',
 'clear',
 'copy',
 'count',
 'extend',
 'index',
 'insert',
 'pop',
 'remove',
 'reverse',
 'sort']

In [13]:
dir(my_array)  #so many more operations, especially math operations!

['T',
 '__abs__',
 '__add__',
 '__and__',
 '__array__',
 '__array_finalize__',
 '__array_function__',
 '__array_interface__',
 '__array_prepare__',
 '__array_priority__',
 '__array_struct__',
 '__array_ufunc__',
 '__array_wrap__',
 '__bool__',
 '__class__',
 '__class_getitem__',
 '__complex__',
 '__contains__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dir__',
 '__divmod__',
 '__dlpack__',
 '__dlpack_device__',
 '__doc__',
 '__eq__',
 '__float__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__iand__',
 '__ifloordiv__',
 '__ilshift__',
 '__imatmul__',
 '__imod__',
 '__imul__',
 '__index__',
 '__init__',
 '__init_subclass__',
 '__int__',
 '__invert__',
 '__ior__',
 '__ipow__',
 '__irshift__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__ixor__',
 '__le__',
 '__len__',
 '__lshift__',
 '__lt__',
 '__matmul__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__o

Most numpy operations are **vectorized** meaning given an array, they act on every element of the array:

In [15]:
print(my_array**2)
print(np.sin(my_array))

[ 1.    9.   20.25]
[ 0.84147098  0.14112001 -0.97753012]


In [18]:
import math
#math.sin is not vectorized:
math.sin(my_array)  #gives eror

TypeError: only length-1 arrays can be converted to Python scalars

In [19]:
f=np.vectorize(math.sin) #this vectorizes the math.sin
f(my_array)

array([ 0.84147098,  0.14112001, -0.97753012])

In [20]:
my_array.mean()

2.8333333333333335

In [21]:
my_list.mean()

AttributeError: 'list' object has no attribute 'mean'

### Numpy matrices

I like to use the `.reshape` method

In [26]:
A=np.array([1, 2, 3, 4, 5, 6]).reshape(2,3) # 2 rows, 3 columns
print(A)

[[1 2 3]
 [4 5 6]]


In [25]:
B=np.array([1, 2, 3, 4, 5, 6]).reshape(3,2) # 3 rows, 2 columns
print(B)

[[1 2]
 [3 4]
 [5 6]]


In [27]:
B[:, 1] #column 2 (python counting starts at 0)

array([2, 4, 6])

In [29]:
B[1,:] # row 2

array([3, 4])

## Pandas

The main data structure or class in Pandas is the `DataFrame`. That basicaly is roughly like a 2D matrix. 

There is also something that is a `pandas.Series` which is like a 1D array, such as a column in an excel spreadsheet.

In [30]:
!pip install pandas



In [31]:
import pandas as pd

A series is a 1d dataframe, think of it as a column. Through pandas, `index` refers to rows, and `columns` refers to columns.

In [59]:
ser = pd.Series([1, 2, 3], index=["a", "b", "c"])
ser

a    1
b    2
c    3
dtype: int64

Can create a data frame from a dictionary

In [61]:
test_scores={
    'Alice': 95,
    'Bob': 80,
    'Carl':90,
    'Demarco':100,
}

test_scores_dataframe=pd.DataFrame(test_scores)

ValueError: If using all scalar values, you must pass an index

In [35]:
pd.DataFrame?

[0;31mInit signature:[0m
[0mpd[0m[0;34m.[0m[0mDataFrame[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mdata[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mindex[0m[0;34m:[0m [0;34m'Axes | None'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcolumns[0m[0;34m:[0m [0;34m'Axes | None'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdtype[0m[0;34m:[0m [0;34m'Dtype | None'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcopy[0m[0;34m:[0m [0;34m'bool | None'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m [0;34m->[0m [0;34m'None'[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
Two-dimensional, size-mutable, potentially heterogeneous tabular data.

Data structure also contains labeled axes (rows and columns).
Arithmetic operations align on both row and column labels. Can be
thought of as a dict-like container for Series ob

In [48]:
test_scores_dataframe=pd.DataFrame(test_scores, index=['Exam 1'])
test_scores_dataframe

Unnamed: 0,Alice,Bob,Carl,Demarco
Exam 1,95,80,90,100


Notice above that the *keys* of the dictionary became column names. The word `index` in pandas generally refers to the  **row** name.

It is surpisingly a bit hard to find how to add a row to a pandas dataframe.
It is easier to add column
 If we want to switch rows and columns, take the transpose `.T`:

0    20
1    50
2    30
3    40
Name: Exam 2, dtype: int64

In [51]:
test_scores_dataframe.T

Unnamed: 0,Exam 1
Alice,95
Bob,80
Carl,90
Demarco,100


The above operation returns a completely different object in memory, i.e. does not do the operation `in place`

In [52]:
test_scores_dataframe

Unnamed: 0,Alice,Bob,Carl,Demarco
Exam 1,95,80,90,100


In [63]:
test_scores_dataframe=test_scores_dataframe.T

In [64]:
test_scores_dataframe

Unnamed: 0,Exam 1
Alice,95
Bob,80
Carl,90
Demarco,100


Accessing data in a column is easy, follows the dictionary conventions.

In [68]:
test_scores_dataframe['Exam 1']

Alice       95
Bob         80
Carl        90
Demarco    100
Name: Exam 1, dtype: int64

In [69]:
test_scores_dataframe['Exam 2']=[20, 30, 40, 50]

In [70]:
test_scores_dataframe

Unnamed: 0,Exam 1,Exam 2
Alice,95,20
Bob,80,30
Carl,90,40
Demarco,100,50


to access data in a **row** need to use `loc` or `iloc`

In [71]:
test_scores_dataframe.loc['Alice']

Exam 1    95
Exam 2    20
Name: Alice, dtype: int64

In [72]:
test_scores_dataframe.iloc[1]  #row 2 has index number 1

Exam 1    80
Exam 2    30
Name: Bob, dtype: int64

In [76]:
test_scores_dataframe.iloc[0:2]

Unnamed: 0,Exam 1,Exam 2
Alice,95,20
Bob,80,30


`iloc` the `i` stands for *integer*

In [77]:
test_scores_dataframe.iloc[0:2, 1:2]

Unnamed: 0,Exam 2
Alice,20
Bob,30


In [79]:
test_scores_dataframe.loc[['Alice', 'Demarco'], ['Exam 2']]

Unnamed: 0,Exam 2
Alice,20
Demarco,50


## axis=0 means column 
(because you are varying the number is the 0-spot, ie. rows)

## axis=1 means rows 
(because you are varying the number is the 1-position, ie. columns)

In [80]:
test_scores_dataframe

Unnamed: 0,Exam 1,Exam 2
Alice,95,20
Bob,80,30
Carl,90,40
Demarco,100,50


In [83]:
test_scores_dataframe['Average']=test_scores_dataframe.mean(axis=1)

In [84]:
test_scores_dataframe

Unnamed: 0,Exam 1,Exam 2,Average
Alice,95,20,57.5
Bob,80,30,55.0
Carl,90,40,65.0
Demarco,100,50,75.0


In [85]:
test_scores_dataframe.mean(axis=0)

Exam 1     91.250
Exam 2     35.000
Average    63.125
dtype: float64

### to add a column, can just do `df[<column name>]`

### but to add a row, must use `.loc[<row name>]`


In [87]:
test_scores_dataframe.loc['Average']=test_scores_dataframe.mean(axis=0)

In [88]:
test_scores_dataframe

Unnamed: 0,Exam 1,Exam 2,Average
Alice,95.0,20.0,57.5
Bob,80.0,30.0,55.0
Carl,90.0,40.0,65.0
Demarco,100.0,50.0,75.0
Average,91.25,35.0,63.125


In [91]:
test_scores_dataframe.loc['max']=test_scores_dataframe.iloc[0:4].max()

In [92]:
test_scores_dataframe

Unnamed: 0,Exam 1,Exam 2,Average
Alice,95.0,20.0,57.5
Bob,80.0,30.0,55.0
Carl,90.0,40.0,65.0
Demarco,100.0,50.0,75.0
Average,91.25,35.0,63.125
max,100.0,50.0,75.0


In [93]:
test_scores_dataframe.loc['median']=test_scores_dataframe.iloc[0:4].median()

In [94]:
test_scores_dataframe

Unnamed: 0,Exam 1,Exam 2,Average
Alice,95.0,20.0,57.5
Bob,80.0,30.0,55.0
Carl,90.0,40.0,65.0
Demarco,100.0,50.0,75.0
Average,91.25,35.0,63.125
max,100.0,50.0,75.0
median,92.5,35.0,61.25


In [96]:
test_scores_dataframe.describe()

Unnamed: 0,Exam 1,Exam 2,Average
count,7.0,7.0,7.0
mean,92.678571,37.142857,64.553571
std,6.862814,10.745985,7.87859
min,80.0,20.0,55.0
25%,90.625,32.5,59.375
50%,92.5,35.0,63.125
75%,97.5,45.0,70.0
max,100.0,50.0,75.0


In [97]:
test_scores_dataframe.iloc[0:4].describe()

Unnamed: 0,Exam 1,Exam 2,Average
count,4.0,4.0,4.0
mean,91.25,35.0,63.125
std,8.539126,12.909944,8.984941
min,80.0,20.0,55.0
25%,87.5,27.5,56.875
50%,92.5,35.0,61.25
75%,96.25,42.5,67.5
max,100.0,50.0,75.0


In [None]:
#

# Exercises, courtesy of ChatGPT

**Me**: I just learned about Pandas dataframes. Give my 10 exercises for practice
**ChatGPT** Sure! Here are 10 exercises to practice with pandas dataframes:

1. Create a dataframe from a dictionary and display the first few rows.

2. Load a CSV file into a dataframe using `pd.read_csv('filename') and display its summary statistics (mean, median, max, min, etc.).
3. Select a specific column from the dataframe and display it.
4. Slice and display rows from the dataframe based on specific conditions (e.g., age > 30).

5. Add a new column to the dataframe based on a calculation using existing columns.
6. Sort the dataframe based on a specific column in ascending order.
7. Group the dataframe based on a column and calculate the mean value for each group.
8. Merge two dataframes based on a common column.
9. Drop rows with missing values from the dataframe.
10. Save the modified dataframe to a new CSV file.