# Briding - Coding with Jupyter Notebook


0. Markdowns, Headings, Comments
1. Built-in modules, third party packages 
2. Arrays, Dictionary and Data frames 
3. Data exploration with pandas and matplotlib
4. Save your work in a readable format


## 1. Built-in Python Modules and Third-party Packages

### 1.1 Built-in modules: math 

You may import built-in python modules without installing them.

Standard libraries distributed with Python:: https://docs.python.org/3.10/library/index.html

In [None]:
import math 
# Documentation https://docs.python.org/3.10/library/math.html
# various functions/variables in math module: ceil, floor, sqrt, pi, log10.
# Note the module.function format

print(math.ceil(1.44))
print(math.floor(10.44))
print(math.sqrt(16))
print(math.pi)
print(math.log10(122))

### 1.2  Third-party packages
Unlike built-in moduels, there are lots of third-party python packages (e.g., ``pandas``, ``numpy``, ``matplolib``) which you need to install before importing.
- use ``pip`` command in JupyTer Notebook

In [None]:
#pip install pandas numpy matplotlib

Import ``pandas`` and ``numpy`` first.

In [None]:
import pandas as pd  
# Documentation: https://pandas.pydata.org/pandas-docs/stable/reference/

import numpy as np
# Documentation: https://numpy.org/doc/stable/reference/

## 2. Arrays, Dictionary and Data frames 

### 2.1 Deal with arrays with numpy

In [None]:
arr = np.array([[1,2,6],[4,5,1]])
arr

In [None]:
arr.shape   

In [None]:
print(arr[0])    # print values on the first row
print(arr[0,0])  # print value in first row, first column

In [None]:
print(np.amax(arr))         # return max value in the flattened array
print(np.amax(arr,axis=0))  # compare across rows and return max values  
print(np.amax(arr,axis=1))  # compare across columns and return max values

In [None]:
print(np.argmax(arr))           # return position of max value in the flattened arrray
print(np.argmax(arr, axis =0))  # compare across rows and return the positon of the max values  
print(np.argmax(arr, axis =1))  # compare across columns and return the positon of the max values  

In [None]:
arr + 10 # element-by-element computation

In [None]:
arr**2  

### 2.2 Deal with data frames with pandas

Create a data frame from an array.

In [None]:
df = pd.DataFrame(arr, columns=['a', 'b', 'c'], index = ['r1','r2']) 
df

In [None]:
df.columns

In [None]:
df.index

In [None]:
df.shape # check data structure

In [None]:
df.iloc[0,0]  # select the value at row 0 and col 0 (.iloc index by position)

In [None]:
df.loc['r1','a'] # select the value at row r1 and column a  (.loc index by labels)

In [None]:
df.loc['r1']  # select a row

In [None]:
df['a']   # select one column

In [None]:
df[['b','c']]  # select multiple columns with double square brackets

Create a data frame from a dictionary (key: value pairs) 

Check [Dictionary Tutorial](https://docs.python.org/3/tutorial/datastructures.html#dictionaries). 


In [None]:
car_dict = {
  "brand": ["Ford",'BMW','Volkswagen','Benz','Benz','Benz','Volkswagen','Volkswagen'],
  "electric": [False,True,False,True,True,False,True,False],
  "year": [1964,1980,2000,1990,2011,2000,2000,2022],
  "colors": ["red", "white", "blue",'white','red','blue','white','white'],
  "price":[500,1000,700,1200,100,100,80,130]
}

car_dict

In [None]:
car_dict['brand'] # extract values associate with a key 

In [None]:
car_df = pd.DataFrame(car_dict)  
car_df

In [None]:
car_df.columns  # check variable names (column names)

In [None]:
car_df.describe(include= 'all') # describe each variable (both continous and categorical)

In [None]:
car_df.set_index('brand', inplace=True)   # set "brand" column as index, replace existing data frame with the new one
car_df

In [None]:
car_df.loc['Ford','price']     # .loc index by labels

In [None]:
car_df.loc[car_df['colors'] == 'red','price']  # select price for cars with red color 

In [None]:
car_df.loc[car_df['colors'] == 'red',['price','year']]   # select price and year for cars with red color 

In [None]:
car_df.reset_index(inplace = True)  # reset the index as default numbers 
car_df

Save the above dataframe ``car_df`` as a csv file named ``car_df.csv`` in CWD (i.e., the current working directory is ``/Users/jingliu/OneDrive - Hong Kong Baptist University/Bridging_python``), just indicate the file name would be enough. 

In [None]:
car_df.to_csv('car_df.csv',index = False)  # save the data frame as a csv file in CWD, ignore index

Read the csv file from CWD, using the relative path. 

In [None]:
new = pd.read_csv('car_df.csv')       
new

## 3. Data exploration with pandas and matplotlib

Here we have saved a csv file named ``diabetes.csv`` in the ``Data`` folder in CWD. 

- the absolute path to the folder is is `/Users/jingliu/OneDrive - Hong Kong Baptist University/Bridging_python/Data`

In [None]:
df = pd.read_csv('Data/diabetes.csv')  ## Read in a csv file.
df.head()

Add a new categorical variable named ``Preg``, indicating whether the person is pregnant or not.

In [None]:
df.loc[df['Pregnancies'] == 0, 'Preg'] = 'No'  # when Pregnancies == 0, 'Preg' is assigned as No
df.loc[df['Pregnancies'] != 0, 'Preg'] = 'Yes'
df.head()

In [None]:
df.shape

In [None]:
df.describe(include='all')

In [None]:
df['Preg'].value_counts()  # count of unique values in a variable

Visualize the relationship between ``Age`` and ``BMI`` with a scatter plot

-- need to import ``matplotlib``first 

In [None]:
import matplotlib.pyplot as plt
# Documentation: https://matplotlib.org/stable/

In [None]:
# A simple scatter plot

plt.figure(figsize=(10,6))   # create a new figure with specific figsize(width, height in inches.)
plt.scatter(df['Age'], df['BMI'], color='lightblue')
plt.xlabel("Age")
plt.ylabel("BMI")
plt.title('BMI and Age');

Plot multiple subplots in one figure
- a scatter plot show the relationship between ``BMI`` and ``Age``
- a line plot shows ``Age`` of each person in this dataset   
- a box plot shows the distribution of ``Age`` in the dataset

In [None]:
fig, axes = plt.subplots(1, 3,                # create a figure with three subplots: 1 row, 3 columns
                         figsize=(12, 5),     # figure size
                         sharey = True,       # share y 
                         constrained_layout=True)  # automatically adjusts subplots so they fit in window

axes[0].scatter(df['BMI'], df['Age'], color='yellow')
axes[0].set_xlabel('BMI')
axes[0].set_ylabel('Age')
axes[0].set_title('Age vs. BMI')  # Set title for the sub plot

axes[1].plot(df['Age'], 'g+-')    # a line plot, with each cross indicate a person's age (y axis)
axes[1].set_title('Line Plot of Age')

axes[2].boxplot(df['Age'])
axes[2].set_title('Age Distribution');

fig.suptitle('Multiple Plots');   # Add a centered suptitle to the figure 
