*Last revision: Thu 30 Oct 2025 18:46:18 AEDT.*  Initial version.

In [None]:
from IPython.display import display, HTML
display(HTML("<style>.container {width:90% !important;}</style>"))

# Lab 1 Introduction to Python

For this course, you will be applying and implementing some aspects of the lecture materials using Python.  Although Python is a general-purpose programming language and many of you are already familiar with it, its importance for data-oriented applications is due to the range of packages available for tasks such as data analysis, statistics, linear algebra, machine learning, and visualization. The work in these notebooks is intended to give you practical experience with setting up and running these tools on realistic datasets.

There are a few concepts you should have some understanding of to complete this work. The following are some resources that may help:
* <a href="https://docs.python.org/3/tutorial/">Python Tutorial</a>
* <a href="http://rosalind.info/problems/list-view/?location=python-village">Introductory Python Exercises</a>   
* <a href="https://realpython.com/tutorials/data-science/">Data Science Topics in Python</a>
* <a href="https://realpython.com/what-is-pip/">What is pip ? (package manager for Python)</a>
* <a href="https://bioconda.github.io/contributor/faqs.html#conda-anaconda-minconda">What is conda/Anaconda ? (package manager plus ...)</a>


##  Some practice for Python packages

For the lab work in this course, we will be using the Python language, which you may or may not have used before. If not, it may help you to do some independent study to pick up the basics. However, this is not a Python course - the focus is on applying practical machine learning algorithms, and Python is simply a tool for us to do so.

* <b>Numpy</b> is a popular Numerical Python data processing library. 

* <b>SciPy</b>  is an open-source software for scientific computing and covers the disciplines of mathematics, science and engineering. 

* <b>Pandas</b>  is a data storage and analysis library that primarily provides utilities to deal with structured records, normally read from CSV files and stored in data frames or tables.

* <b>Scikit-Learn</b> is a Python library for high performance Machine Learning.

* <b>Matplotlib</b> is a Python plotting library that allows you to make interactive plots.

* <b>Seaborn</b>  is a Python data visualization library based on Matplotlib, providing a high-level interface for drawing attractive and informative statistical graphics.

* <b>pip</b>  is a package manager for Python that allows quick installation and management of Python packages.

### Import all the packages that we will learn or practice
Prior to starting, you should check whether you have installed the following packages like numpy, scipy, pandas, scikit-learn, matplotlib and seaborn.

If you are using Linux or OSX you can use the following code at the command-line in a terminal to install these packages:

```pip install numpy scipy pandas scikit-learn matplotlib seaborn```

If you are using anaconda, you can use the following code at the Anaconda prompt to install these packages:

```conda install numpy scipy pandas scikit-learn matplotlib seaborn```

To edit code in the Jupyter environment, you simply click on the cell where the code should go, ensure ```code``` is selected in the dropdown menu at the top of the notebook, and start typing. 

To run code in the Jupyter environment, you simply click in the code area and type ```shift+enter``` (or ```shift+return``` on Mac).

To edit markdown, simply do the same but ensure ```markdown``` is selected. 

<b>Tip:</b> in a Jupyter notebook, if code you enter in the cell you are working in depends on code defined in a cell <i>above</i>, that code must have been executed before you try and run your code. You can do this by selecting the "keyboard" icon at the top of the page which will bring up a menu with several options to run code in the notebook. A useful option is ```run all cells above```.

In [None]:
import numpy as np
import scipy
import pandas as pd
import sklearn as sk
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

## NumPy
### Some basic concepts

In [None]:
list_object=[1,2,3,4]                          # Creating a list object
array=np.array(list_object)                    # Converting the original list object into numpy array object
array=np.array(list_object,dtype=np.float32)   # Specifying data type
zeros=np.zeros((4,3))                          # Creating matrix with 4*3 zeros

print("list object: \n")
print(list_object)
print()
print("array object: \n")
print(array)
print()
print("zero matix: \n")
print(zeros)
print()

### Operators
Arithmetic operators

In [None]:
x=np.array([4,6])
y=np.array([2,3])
z = x + y                                # x and y are numpy arrays with the same size
print("z = x + y: ",z)
z = x * y
print("z = x * y: ",z)
z = x / y
print("z = x / y: ",z)

Comparison operators

In [None]:
print("Hello world")
print("x: ", x)
print("y: ", y)
z = x > y
print("z = x > y:", z)
z = x > 5
print("z = x > 5:", z)

Unary operators

In [None]:
A = np.arange(9).reshape((3,3)) 
print("matrix A: \n",A)
sum_a = np.mean(A)
print("mean:", sum_a)
col_sum = A.sum(axis = 0)               # calculates sum of each column
print("col_sum", col_sum)
row_sum = A.sum(axis = 1)               # calculates sum of each row
print("row_sum", row_sum)

## Data Exploration Using Pandas
Start by loading the "diabetes" dataset into a DataFrame, then save the data set into a CSV file, followed by re-loading the CSV file.

In [None]:
from sklearn import datasets

# Load the diabetes dataset
diabetes = datasets.load_diabetes()

# Putting the dataset into a Pandas DataFrame
data = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
target = pd.DataFrame(diabetes.target, columns=["target"])

# Combining the two dataframes into one
df = pd.concat([data,target], axis=1)

# Saving the data frame into "diabetes.csv" file 
df.to_csv("diabetes.csv", index=False) 

#Loading the data from csv file
csv_df = pd.read_csv("diabetes.csv")

Analyzing DataFrames

In [None]:
# prints the first few rows of the Dataframe
csv_df.head()

In [None]:
# provides a concise summary of the DataFrame
csv_df.info()

In [None]:
# provides descriptive statistics of central tendency, dispersion and shape
csv_df.describe()

Views and Slicing

In [None]:
bmi = csv_df["bmi"]                       # get the column values for column header 'bmi' 

csv_df["bmi_ex"] = csv_df["bmi"] > 0      # creates a new column with True / False values if bmi > 0

csv_df[csv_df.bmi_ex]                     # selecting rows of entire dataframe where bmi > 0

csv_df[csv_df.bmi_ex][:10]                # selecting first 10 rows where bmi > 0

## Matplotlib for basic plotting

In [None]:
import matplotlib.pyplot as plt
import numpy as np

N = 100                                              # setting number of points
data_x = np.arange(N)                                # Generate an array with values from 0 to N
rdm = (np.random.rand(N)-0.5)                        # rand(N) returns N random numbers between 0 and 1
data_y1 = data_x + rdm*10                            # Linear wrt x, with noise
plt.scatter(data_x, data_y1, color='blue')           # Scatter plot; color parameter is optional
plt.plot(data_x,data_x, "r-")                        # Line plot to show 'true' function without noise in red
plt.show()

#### Reading Datasets
In order to perform any kind of statistics or machine learning we typically need a significant amount of data. By visualizing the data, analysing patterns to understand the data and using algorithms to fit models to the data, we can achieve meaningful results. Scikit-learn makes it easy for us to access some pre-defined 'toy' datasets to practice our understanding.

In this example, we'll use the "diabetes" dataset, which contains records for 442 diabetes patients. The 10 features in the dataset represent each patient's age, sex, body mass index, average blood pressure, and six blood serum measurements. The response of interest is a quantitative measure of disease progression one year after baseline. We'll use this to fit a linear regression model to predict a patient's disease progression based on any of their features.

Read through the code below to understand how to begin an analysis of this particular data.

In [None]:
from sklearn import datasets, linear_model

diabetes = datasets.load_diabetes()                                 # load the diabetes dataset
data = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)  # make diabetes dataframe, specify col names
target = pd.DataFrame(diabetes.target, columns=["target"])          # target col is variable we wish to predict from data
df = pd.concat([data,target], axis=1)                               # concatenate data and target into one dataframe

corr = df.corr()                                                    # Calculate the correlation between x and y.
corr_abs=corr.abs()                                                 # get the absolute value of correlation
sns.heatmap(corr_abs,       
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values)                        # create correlation heat map
plt.show()

corr_array = np.array(corr["target"])[:-1]
corr_abs_array = np.array(corr_abs["target"])[:-1]
i = np.argmax(corr_abs_array)
feature = corr.columns[i]
print('Feature', feature, 'has the largest correlation with target feature, with correlation:', corr_array[i])

### Conclusion and Further Resources
We have only just scratched the surface with these very powerful modules! Try to get familiar with the features covered here, and for those of you interested in seeing more examples, we strongly advise you to look at the following resource: https://jakevdp.github.io/PythonDataScienceHandbook/
