# COMP8325 Week 2 Workshop

In Week 2 practical, our goal is to get familiar with the Python libraries we will be using throughout the semester to implement learnt concepts from lectures. 

## NumPy

NumPy is one of the fundamental packages for scientific computing in Python. It
contains functionality for multidimensional arrays, high-level mathematical functions
such as linear algebra operations and the Fourier transform, and pseudorandom
number generators.

In Scikit-learn, the NumPy array is the fundamental data structure. Scikit-learn
takes in data in the form of NumPy arrays. Any data you’re using will have to be converted
to a NumPy array. The core functionality of NumPy is the ndarray class, a
multidimensional (n-dimensional) array. All elements of the array must be of the
same type. A NumPy array looks like this:

In [1]:
import numpy as np


In [None]:
# Create an array
x = np.array([[1, 2, 3], [4, 5, 6]])
print("x:\n{}".format(x))

NumPy provides many convenience functions for creating matrices/vectors.

In [None]:
# create an array of all zeros
a = np.zeros((2,2))
print (a)

In [None]:
# create an array of all ones
b = np.ones((1,2))
b

In [None]:
# create a constant array
c = np.full((2,2), 7)
c

In [None]:
# create 2x2 identity matrix
d = np.eye(2)
d

In [None]:
# create an array filled with random values between 0 and 1.
e = np.random.random((2,2))
e

In [None]:
# Creating array with vstack()

v1 = np.array([1, 2, 3])
v2 = np.array([4, 5, 6])
v3 = np.array([7, 8, 9])

M = np.vstack([v1, v2, v3])
print (M)

In [None]:
# Create any array with hstack()

v1 = np.array([1, 2, 3])
v2 = np.array([4, 5, 6])
v3 = np.array([7, 8, 9])

M = np.hstack([v1, v2, v3])
print (M)

Numpy for Linear Algebra

In [None]:
M = np.array([[1, 2, 3],
             [4, 5, 6],
             [7, 8, 9]])

v = np.array([[1],
              [2],
              [3]])

# Addition of two matrices
print (v + v)

In [None]:
# Scalar multipication
print (v)

print ("\n 3 x v: ") 
print (3*v)

In [None]:
# Dot Multiplication
M = np.array ([[3, 0, 2], 
              [2, 0, -2], 
              [0, 1, 1]])

v = np.array([[1], 
              [2], 
              [3]])
print (M.dot(v))

In [None]:
# Is this right?
print (v.dot(M))

In [None]:
# Can go with Transpose
print (v.T.dot(M))

## Pandas

Pandas is a Python library for data wrangling and analysis. It is built around a data
structure called the DataFrame that is modeled after the R DataFrame. Simply put, a
Pandas DataFrame is a table, similar to an Excel spreadsheet. Pandas provides a great
range of methods to modify and operate on this table; in particular, it allows SQL-like
queries and joins of tables. In contrast to NumPy, which requires that all entries in an
array be of the same type, Pandas allows each column to have a separate type (for
example, integers, dates, floating-point numbers, and strings). 

In [16]:
import pandas as pd

In [None]:
# Load the collected dataset. Note that we need to specify the column names and the missing values
property_data = pd.read_csv('property-sold-price-hornsby.csv', header=0, na_filter=True, na_values='?')

property_data.info()

property_data

In [None]:
# Use the describe() method of DataFrame in Pandas.
property_data.describe()

Missing value imputation. If your collected dataset has missing values, you need to handle the missing values by imputation. Refer to https://scikit-learn.org/stable/modules/impute.html#impute for details about how to acheive this in Scikit-learn.

In [None]:
# Missing value handling
print("Before imputating missing values: ")
print(property_data.isnull().sum())

# Perform imputation
property_data["# car parking"].fillna(property_data["# car parking"].median(), inplace=True)
property_data["Land area (m2)"].fillna(property_data["Land area (m2)"].mean(), inplace=True)

# check for missing values
print("\nAfter imputating missing values")
print(property_data.isnull().sum())

It can be seen that the property features have different scales. To cater for more machine learning models, we need to scale the features to make each feature can be treated equally. Refer to https://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling  for more details about how to achieve feature scaling.

In [None]:
# Use the Min-Max feature scaling to transform the original data
from sklearn.preprocessing import MinMaxScaler

# Select the numerical attributes
attributes=["# bedrooms", "# bathrooms", "# car parking", "Land area (m2)", "Sold price ($M)"]
scaler =MinMaxScaler()
property_data_num = property_data[attributes]
property_data_num_scaled=pd.DataFrame(scaler.fit_transform(property_data_num), columns=property_data_num.columns)

property_data_num_scaled

## Matplotlib

Matplotlib is the primary scientific plotting library in Python. It provides functions
for making publication-quality visualizations such as line charts, histograms, scatter
plots, and so on. Visualizing your data and different aspects of your analysis can give
you important insights, and we will be using Matplotlib for all our visualizations.
When working inside the Jupyter Notebook, you can show figures directly in the
browser by using the Matplotlib notebook and Matplotlib inline commands.
We recommend using Matplotlib notebook, which provides an interactive environment.

In [24]:
import matplotlib.pyplot as plt

In [None]:
# Plot the property price for each house
plt.plot(property_data_num_scaled.index, property_data_num_scaled['Sold price ($M)'], 'r', lw=1)
plt.xlabel('Property ID')
plt.xticks(np.arange(0, 20, step=1))
plt.ylabel('Sold price ($M)')
plt.show()

In [None]:
# Visualize the relationship between two features. What can you see from the figure?
plt.scatter(property_data_num_scaled['Land area (m2)'], property_data_num_scaled['Sold price ($M)'])
plt.xlabel('Land area (m2)')
plt.ylabel('Sold price ($M)')

## Tasks

1. Use Pandas to load the dataset "data/Tweets.csv". Check if missing values exist in the data set. For each of the columns having missing values, do we need to remove the whole column or perform missing value imputation? Justify your answers. If we need to perform imputation, please do that.

In [35]:
tweets_data = pd.read_csv('tweets.csv', header=0, na_filter=True, na_values='?')


In [None]:
# Write and execute your code here

2. Calculate the average "airline_sentiment_confidence" for each "airline". Plot bar chart to show the average "airline_sentiment_confidence" for each "airline".

In [None]:
# Write and execute your code here


3. Use TF-IDF to transform the "text" column into a numerical feature vector. Refer to https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html for the details about how to achieve this in the sklearn module.

In [33]:
text = tweets_data['text']

In [None]:
# Write and execute your code here
