## **Introduction to Data Science Tools**

This is the introduction of data science tools and the fundamentals of the things I will be working with in data science. This lab will teach me the basic libraries such as; Pandas, Numpy, and Scikit-learn.

Reference: ["https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.paired_euclidean_distances.html"]

# **A. Pandas**

[__Pandas__](https://pandas.pydata.org/docs/) is a _library_ for the Python language that provides data structures and functions that simplify data processing and analysis.

In [None]:
# import library
import pandas as pd

## Objek _DataFrame_

_DataFrame_ is the main data structure in Pandas. This data structure can be used to store 2-dimensional table data. The rows of a _DataFrame_ are ordered starting from 0. If column names are not given, then the columns will also be ordered from 0. You can create a _DataFrame_ from a generic collection such as _list_ or read a file (such as CSV, JSON, XML) as a _DataFrame_.

In [None]:
#Creates Dataframe
df = pd.read_csv(
    'https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data', # Path menuju berkas dataset, dapat berupa path lokal maupun remote
    header=None # First line of dataset is not a header
)
display(df) # Shows Dataframe

Unnamed: 0,0,1,2,3,4,5,6
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc
...,...,...,...,...,...,...,...
1723,low,low,5more,more,med,med,good
1724,low,low,5more,more,med,high,vgood
1725,low,low,5more,more,big,low,unacc
1726,low,low,5more,more,big,med,good


Give column names to the _DataFrame_ to make them more meaningful

In [None]:
df.columns = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'label']
df

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,label
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc
...,...,...,...,...,...,...,...
1723,low,low,5more,more,med,med,good
1724,low,low,5more,more,med,high,vgood
1725,low,low,5more,more,big,low,unacc
1726,low,low,5more,more,big,med,good


**Data Set "Car Evaluation"**

* `buying`: buying price
* `maint`: price of the maintenance
* `doors`: number of doors
* `persons`: capacity in terms of persons to carry
* `lug_boot`: the size of luggage boot
* `safety`: estimated safety of the car
* `label`: car acceptability (target)

Data set source: https://archive.ics.uci.edu/ml/datasets/car+evaluation

Use _method_ <code>head()</code> to view only the first 5 rows of the _DataFrame_. The parameters in <code>head()</code> can also be changed according to the number of lines you want to display. Use <code>tail()</code> to view the last 5 lines.

In [None]:
df.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,label
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


## Displays the attributes of the _DataFrame_ object

Returns the dimensions of a _DataFrame_ with the <code>shape</code> attribute

In [None]:
print("Dimension of df: ", df.shape)
print("df consists of {} rows and {} columns.".format(df.shape[0], df.shape[1]))

Dimensi dari df adalah (1728, 7)
df terdiri atas 1728 baris dan 7 kolom.


Displays the data type for each column with the <code>dtypes</code> attribute

In [None]:
df.dtypes

buying      object
maint       object
doors       object
persons     object
lug_boot    object
safety      object
label       object
dtype: object

Display general statistics for each column with _method_ <code>describe()</code>

In [None]:
df.describe()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,label
count,1728,1728,1728,1728,1728,1728,1728
unique,4,4,4,3,3,3,4
top,low,low,5more,more,small,low,unacc
freq,432,432,432,576,576,576,1210


In [None]:
# Statistical description for the desired column
df['buying'].describe()

count     1728
unique       4
top        low
freq       432
Name: buying, dtype: object

Count the number of non-null elements per column with __method__ <code>count()</code>

In [None]:
print("Count for df\n", df.count())

Count untuk df
 buying      1728
maint       1728
doors       1728
persons     1728
lug_boot    1728
safety      1728
label       1728
dtype: int64


It can be seen above that the total df counts for 1728 rows, but this does not absolutely mean there are no _null_ values. Checking and handling _null_ values is discussed in the subsection "__Handling _missing values___".

## Access _DataFrame_ rows and columns

We can access _DataFrame_ columns like accessing indexes on an array or list, either as a Series or a new DataFrame.

In [None]:
# Access column as Series
df_buying = df['buying']
print(type(df_buying))
df_buying

<class 'pandas.core.series.Series'>


0       vhigh
1       vhigh
2       vhigh
3       vhigh
4       vhigh
        ...  
1723      low
1724      low
1725      low
1726      low
1727      low
Name: buying, Length: 1728, dtype: object

In [None]:
# Access column as Dataframe
df_buying_2 = df[['buying']]
print(type(df_buying_2))
df_buying_2

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,buying
0,vhigh
1,vhigh
2,vhigh
3,vhigh
4,vhigh
...,...
1723,low
1724,low
1725,low
1726,low


In [None]:
# Access more than 1 column as a DataFrame
df_buying_maint = df[['buying', 'maint']]
df_buying_maint

Unnamed: 0,buying,maint
0,vhigh,vhigh
1,vhigh,vhigh
2,vhigh,vhigh
3,vhigh,vhigh
4,vhigh,vhigh
...,...,...
1723,low,low
1724,low,low
1725,low,low
1726,low,low


Access to __exactly one__ column can also be done with the __dot__ (.) operator provided the column has a name (does not use the _default_ number index). Column names must not use attribute names of the _DataFrame_ object. Column names that can be accessed with __dot__ must also not contain characters that are prohibited in Python variable names.

In [None]:
# Access a column with the dot operator
display(df.label)

0       unacc
1       unacc
2       unacc
3       unacc
4       unacc
        ...  
1723     good
1724    vgood
1725    unacc
1726     good
1727    vgood
Name: label, Length: 1728, dtype: object

We can access the rows (and columns) of the _DataFrame_ using <code>__loc__</code> or <code>__iloc__</code>. <code>loc</code> accesses the index based on the label of the row or column, while <code>iloc</code> accesses the index based on the position in an integer. <code>loc</code> also accepts _range_ inclusively.

In [None]:
# Access rows 0 to 3 using loc
df.loc[0:3]  # range inclusive

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,label
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc


In [None]:
# Access rows 0 to 3 using iloc
df.iloc[0:4] # range exclusive

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,label
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc


In [None]:
# Access rows 0 - 3, but only the maint and doors columns, using loc
df.loc[0:3, ["maint", "doors"]]

Unnamed: 0,maint,doors
0,vhigh,2
1,vhigh,2
2,vhigh,2
3,vhigh,2


In [None]:
# Access rows 0 - 3, but only the main and doors columns, using iloc
df.iloc[0:4, 1:3]

Unnamed: 0,maint,doors
0,vhigh,2
1,vhigh,2
2,vhigh,2
3,vhigh,2


In [None]:
# Insert specific indexes using loc and iloc
print(df.loc[1,"label"])
print(df.iloc[1,6])

unacc
unacc


Indexing can also be done without using <code>loc</code> or <code>iloc</code>, with the index sequence starting from the column, then the index from the row.

In [None]:
# Indexing withour loc nor iloc
print(df["label"][0])

unacc


## Handling Missing Values

_null_ values or "_missing values_" are information that is filled in with _null_ or not filled in. This can happen because the information is not provided (but is available), is difficult to find, or simply does not exist. _Missing values_ can also be stored as "NA" or "NaN".

In [None]:
# Displays the number of missing values for each column
print("Count missing values untuk df\n", df.isnull().sum())

Count missing values untuk df
 buying      0
maint       0
doors       0
persons     0
lug_boot    0
safety      0
label       0
dtype: int64


It can be seen above that in df no _null_ values were found. Next, we try to read the new dataset in the file '<code>MELBOURNE_HOUSE_PRICES_LITE.csv</code>' attached to SCELE.

**Data Set of "Melbourne House Prices"**
* `Suburb`: Suburb
* `Address`: Address
* `Rooms`: Number of rooms
* `Type`:
    <br> br - bedroom(s);
    <br> h - house, cottage, villa, semi, terrace;
    <br> u - unit, duplex;
    <br> t - townhouse;
    <br> dev site - development site;
    <br> o res - other residential;

* `Price`: Price in Australian dollars
* `Method`:
    <br> S - property sold;
    <br> SP - property sold prior;
    <br> PI - property passed in;
    <br> PN - sold prior not disclosed;
    <br> SN - sold not disclosed;
    <br> NB - no bid;
    <br> VB - vendor bid;
    <br> W - withdrawn prior to auction;
    <br> SA - sold after auction;
    <br> SS - sold after auction price not disclosed.
    <br> N/A - price or highest bid not available.

* `SellerG`: Real Estate Agent
* `Date`: Date sold
* `Postcode`: Postcode
* `Regionname`: General Region (West, North West, North, North East, etc.)
* `Propertycount`: Number of properties that exist in the suburb
* `Distance`: Distance from CBD in Kilometres
* `CouncilArea`: Governing council for the area

Source of data set: https://www.kaggle.com/anthonypino/melbourne-housing-market


In [None]:
df2 = pd.read_csv('https://raw.githubusercontent.com/iqrafarhan/KASDD/main/MELBOURNE_HOUSE_PRICES_LITE.csv')

# Displays the number of missing values for each column
print("Count missing values untuk df2\n", df2.isnull().sum())

Count missing values untuk df2
 Suburb            0
Address           0
Rooms             0
Type              0
Price            76
Method            0
SellerG           0
Date              0
Postcode          0
Regionname        0
Propertycount     0
Distance          0
CouncilArea       0
dtype: int64


It can be seen above that there is a column "Price" which has a _null_ value. It is marked as "NaN" in the _DataFrame_ if displayed as below.

In [None]:
# Shows DataFrame df2
df2.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Postcode,Regionname,Propertycount,Distance,CouncilArea
0,Chintin,24 Porcupine Ct,5,h,,PN,LJ,1/04/2017,3756,Northern Victoria,39,44.2,Macedon Ranges Shire Council
1,Strathmore Heights,5 Avro Ct,3,h,900000.0,PI,Considine,1/04/2017,3041,Western Metropolitan,389,8.2,Moonee Valley City Council
2,Caulfield East,15 Grange Rd,3,h,1400000.0,S,Gary,1/04/2017,3145,Southern Metropolitan,608,8.4,Glen Eira City Council
3,Huntingdale,31 Berkeley St,3,h,1145000.0,S,Ray,1/04/2017,3166,Southern Metropolitan,768,12.3,Monash City Council
4,Essendon West,32 Garnet St,3,h,1430000.0,S,Nelson,1/04/2017,3040,Western Metropolitan,588,7.5,Moonee Valley City Council


There are several approaches to dealing with _missing values_, here are the commonly used approaches:
* Fill in _missing values_ with certain values using _method_ <code>fillna()</code>
* Delete (_drop_) rows or columns that contain _missing values_. Use <code>dropna()</code> to delete __rows__ that contain _missing values_, and <code>dropna(axis='columns')</code> to delete __columns__ that contain missing values.

The examples below will focus on the "Price" column.

In [None]:
# fillna() with average score
mean = df2.Price.mean() # gets nilai average Price
df2_fill_mean = df2.copy()
df2_fill_mean.Price.fillna(mean,inplace=True)
df2_fill_mean.head(5)

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Postcode,Regionname,Propertycount,Distance,CouncilArea
0,Chintin,24 Porcupine Ct,5,h,1032403.0,PN,LJ,1/04/2017,3756,Northern Victoria,39,44.2,Macedon Ranges Shire Council
1,Strathmore Heights,5 Avro Ct,3,h,900000.0,PI,Considine,1/04/2017,3041,Western Metropolitan,389,8.2,Moonee Valley City Council
2,Caulfield East,15 Grange Rd,3,h,1400000.0,S,Gary,1/04/2017,3145,Southern Metropolitan,608,8.4,Glen Eira City Council
3,Huntingdale,31 Berkeley St,3,h,1145000.0,S,Ray,1/04/2017,3166,Southern Metropolitan,768,12.3,Monash City Council
4,Essendon West,32 Garnet St,3,h,1430000.0,S,Nelson,1/04/2017,3040,Western Metropolitan,588,7.5,Moonee Valley City Council


In [None]:
# dropna to delete rows containing missing values
df2_drop_row = df2.dropna()
df2_drop_row.head(5)

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Postcode,Regionname,Propertycount,Distance,CouncilArea
1,Strathmore Heights,5 Avro Ct,3,h,900000.0,PI,Considine,1/04/2017,3041,Western Metropolitan,389,8.2,Moonee Valley City Council
2,Caulfield East,15 Grange Rd,3,h,1400000.0,S,Gary,1/04/2017,3145,Southern Metropolitan,608,8.4,Glen Eira City Council
3,Huntingdale,31 Berkeley St,3,h,1145000.0,S,Ray,1/04/2017,3166,Southern Metropolitan,768,12.3,Monash City Council
4,Essendon West,32 Garnet St,3,h,1430000.0,S,Nelson,1/04/2017,3040,Western Metropolitan,588,7.5,Moonee Valley City Council
5,Strathmore Heights,2/1 De Havilland Av,2,h,635500.0,S,Considine,1/07/2017,3041,Western Metropolitan,389,8.2,Moonee Valley City Council




## Exercise

In [None]:
# Imports
import pandas as pd

1. Open the dataset <code>MELBOURNE\_HOUSE\_PRICES\_LITE.csv</code> as a _DataFrame_ and display the first __15 rows__.

In [None]:
dfsatu = pd.read_csv('https://raw.githubusercontent.com/iqrafarhan/KASDD/main/MELBOURNE_HOUSE_PRICES_LITE.csv')
dfsatu.head(15)

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Postcode,Regionname,Propertycount,Distance,CouncilArea
0,Chintin,24 Porcupine Ct,5,h,,PN,LJ,1/04/2017,3756,Northern Victoria,39,44.2,Macedon Ranges Shire Council
1,Strathmore Heights,5 Avro Ct,3,h,900000.0,PI,Considine,1/04/2017,3041,Western Metropolitan,389,8.2,Moonee Valley City Council
2,Caulfield East,15 Grange Rd,3,h,1400000.0,S,Gary,1/04/2017,3145,Southern Metropolitan,608,8.4,Glen Eira City Council
3,Huntingdale,31 Berkeley St,3,h,1145000.0,S,Ray,1/04/2017,3166,Southern Metropolitan,768,12.3,Monash City Council
4,Essendon West,32 Garnet St,3,h,1430000.0,S,Nelson,1/04/2017,3040,Western Metropolitan,588,7.5,Moonee Valley City Council
5,Strathmore Heights,2/1 De Havilland Av,2,h,635500.0,S,Considine,1/07/2017,3041,Western Metropolitan,389,8.2,Moonee Valley City Council
6,Essendon West,24 Clydebank Rd,4,h,,SP,Barry,1/07/2017,3040,Western Metropolitan,588,7.5,Moonee Valley City Council
7,Essendon West,6 Bourke St,3,h,750000.0,VB,Brad,1/07/2017,3040,Western Metropolitan,588,7.5,Moonee Valley City Council
8,Huntingdale,9/220 Huntingdale Rd,3,h,750000.0,S,Ray,1/07/2017,3166,Southern Metropolitan,768,12.3,Monash City Council
9,Keilor Lodge,8 Turin Pl,4,h,640000.0,PI,Barry,1/07/2017,3038,Western Metropolitan,570,15.5,Brimbank City Council


2. Show __how many__ unique _CouncilArea_ are in the dataset. Then what _CouncilArea_ __appears the most__ and __how many appearances__?

In [None]:
# Jawaban no. 2
dfsatu['CouncilArea'].describe()

count                            335
unique                            17
top       Moonee Valley City Council
freq                              84
Name: CouncilArea, dtype: object

From the results above, it can be seen that there are 17 unique CouncilAreas. The council area that appeared the most was Moonee Valley City Council with a total of 84.

3. __Create a new _DataFrame__ containing only the _Regionname_, _Rooms_, and _Price_ columns for the first 50 rows. __Display the _shape_ and first 5 rows__ of the new _DataFrame_.

In [None]:
dfdua = dfsatu.loc[0:50, ['Regionname', 'Rooms', 'Price']]
display(dfdua.shape)
dfdua.head(5)

(51, 3)

Unnamed: 0,Regionname,Rooms,Price
0,Northern Victoria,5,
1,Western Metropolitan,3,900000.0
2,Southern Metropolitan,3,1400000.0
3,Southern Metropolitan,3,1145000.0
4,Western Metropolitan,3,1430000.0


4. __Display all Regionnames__ of the initial _DataFrame_, __as _Pandas Series___, which have a _Propertycount_ __greater than or equal to 400__. (_Hint_: Please explore how to access _DataFrame_ rows with _boolean_ conditions using <code>loc</code>, <code>iloc</code>, or other means.)

In [None]:
dftiga = dfsatu.loc[dfsatu['Propertycount'] >= 400]
dfempat = dftiga['Regionname']
display(dfempat)

2           Southern Metropolitan
3           Southern Metropolitan
4            Western Metropolitan
6            Western Metropolitan
7            Western Metropolitan
                  ...            
329    South-Eastern Metropolitan
330          Western Metropolitan
331         Southern Metropolitan
332         Southern Metropolitan
334          Western Metropolitan
Name: Regionname, Length: 250, dtype: object

5. Create a new _DataFrame_ from the initial _DataFrame_ by __deleting all rows__ that have _null_ values. __Display the average price__ of houses in the new _DataFrame_.

In [None]:
dfdrop = dfsatu.dropna()
mean = dfdrop.Price.mean()
mean

1032402.749034749

6. Create a new _DataFrame_ from the initial _DataFrame_ by __filling all null values__ in the _Price_ column with the minimum value. __Display the average price__ of houses in the new _DataFrame_.

In [None]:
min = dfsatu.Price.min()
df2_fill_mean = dfsatu.copy()
df2_fill_mean.Price.fillna(min,inplace=True)
mean1 = df2_fill_mean.Price.mean()
mean1

857171.080597015

7. Is there __ difference __ new average scores in numbers 5 and 6? __Why is that?__

Yes, because in number 5, missing values are simply removed, while in number 6, missing values are replaced with minimum values. So, when calculating the average, it has a different value.

# **B. NumPy**

[NumPy](https://numpy.org/) (Numerical Python) is a Python library focused on <i>Scientific computing</i>. NumPy is similar to Lists in Python, only it has several advantages such as smaller memory usage, faster runtime, and makes it easier for us to perform vector and matrix operations. This causes NumPy to become one of the libraries that is widely used in the data analysis process.

In [None]:
# import library NumPy
import numpy as np

## Array NumPy

To create a new array, we can create a python List then convert it into a NumPy array using the <code>array()</code> function in NumPy

In [None]:
# Creating numpy array
arr = np.array([10,20,30,40,50])
arr

array([10, 20, 30, 40, 50])

There are several built-in functions that can be used to create arrays

In [None]:
# Create an array with the value 0 as many as 5
print("np.zeros(5) : ",np.zeros(5))

# Create an array with the value 1 as many as 5
print("np.ones(5) : ",np.ones(5))

# Create an array with values in the range 1 to 10 with a step of 2
print("np.arange(1,10,2) : ",np.arange(1,10,2))

# Create an array with 4 values in the range 1 to 10 at the same interval
print("np.linspace(1,10,4) : ",np.linspace(1,10,4))

np.zeros(5) :  [0. 0. 0. 0. 0.]
np.ones(5) :  [1. 1. 1. 1. 1.]
np.arange(1,10,2) :  [1 3 5 7 9]
np.linspace(1,10,4) :  [ 1.  4.  7. 10.]


## Array Multidimention

Apart from 1-dimensional arrays, NumPy also has the ability to create and manipulate multidimensional arrays easily

In [None]:
# Creating numpy array multidimention / matrix
mult_arr = np.array([[1,2,3],[4,5,6],[10,20,30]])
mult_arr

array([[ 1,  2,  3],
       [ 4,  5,  6],
       [10, 20, 30]])

## Displays the Attributes of an Array

There are several functions that can be used to display the attributes of an array

In [None]:
print(mult_arr)

print()

# Indicates the size of the array
print("mult_arr.size:", mult_arr.size)

# Indicates the number of dimensions or rank of an array
print("mult_arr.ndim:", mult_arr.ndim)

# Know the size of each dimension of the array
print("mult_arr.shape:", mult_arr.shape)

[[ 1  2  3]
 [ 4  5  6]
 [10 20 30]]

mult_arr.size: 9
mult_arr.ndim: 2
mult_arr.shape: (3, 3)


## Check Type

We can find out the type of an object using the <code>type()</code> function

In [None]:
# Checking the list type
py_lst = [1,2,3,4,5]
print("type(py_list): ")
display(type(py_lst))

print()

# Check the type of array
print("type(mult_arr): ")
display(type(mult_arr))

type(py_list): 


list


type(mult_arr): 


numpy.ndarray

To check the data type in a numpy array, you can use <code>dtype()</code>

In [None]:
# Check the data type of the array
mult_arr.dtype

dtype('int64')

## Indexing, Slicing, and Assigning Values to Arrays

In [None]:
arr

array([10, 20, 30, 40, 50])

Accesses the first element of the array

In [None]:
arr[0]

10

In [None]:
mult_arr

array([[ 1,  2,  3],
       [ 4,  5,  6],
       [10, 20, 30]])

In [None]:
# Access elements of the second row, third column of a multidimensional array
display(mult_arr[1,2])

# Another way
display(mult_arr[1][2])

6

6

We can perform slicing on numpy arrays like Lists

In [None]:
arr

array([10, 20, 30, 40, 50])

In [None]:
# Slice index elements 1 to 3
b = arr[1:4]
b

array([20, 30, 40])

We can also perform slicing on multidimensional arrays

In [None]:
mult_arr

array([[ 1,  2,  3],
       [ 4,  5,  6],
       [10, 20, 30]])

In [None]:
#Slice elements from the first row, columns 1 and 2
mult_arr[0][0:2]

array([1, 2])

In [None]:
# Second column slice
mult_arr[:,1]

array([ 2,  5, 20])

We can change the value of an array.

In [None]:
# Change the fifth element of array a to 500
arr[4] = 500
arr

array([ 10,  20,  30,  40, 500])

In [None]:
# Change the first row element of the first column of multidimensional array A to 100
mult_arr[0][0] = 100
mult_arr

array([[100,   2,   3],
       [  4,   5,   6],
       [ 10,  20,  30]])

We can use a list to select specific indexes

In [None]:
# Create a list containing indexes
lst = [0,3,4]

In [None]:
# Use a list to select elements
c = arr[lst]
c

array([ 10,  40, 500])

Apart from that, we can also assign elements at indexes in the list with new values

In [None]:
# Assign a new value to the index in the list
arr[lst] = 1234
arr

array([1234,   20,   30, 1234, 1234])

## Operasi pada Array

In [None]:
x = np.array([10, 15, 20, 25, 30])
y = np.array([30, 25, 20, 15, 10])

In [None]:
# Summation Array
print("Penjumlahan Array : ", x + y)

# Substraction Array
print("Pengurangan Array : ", x - y)

Penjumlahan Array :  [40 40 40 40 40]
Pengurangan Array :  [-20 -10   0  10  20]


In [None]:
# Summation Array with constant numbers
print("Penjumlahan dengan sebuah bilangan : ", x + 3)

# Array multiplication with constant numbers
print("Perkalian dengan sebuah bilangan : ", x * 2)

Penjumlahan dengan sebuah bilangan :  [13 18 23 28 33]
Perkalian dengan sebuah bilangan :  [20 30 40 50 60]


In [None]:
m = np.array([1,2])
n = np.array([0,1])

# Multiplication of 2 arrays
print("Perkalian array m dan n : ", m * n)

# Dot product of 2 arrays
print("Dot product m dan n : ", np.dot(m,n))

Perkalian array m dan n :  [0 2]
Dot product m dan n :  2


## Operations on Multidimensional Arrays / Matrix

Operations on arrays also apply to multidimensional arrays / matrices. We can also find out the transpose and inverse of a matrix through the following function

In [None]:
P = np.array([[1,2],[3,4]])

# Matrix P
print(P)

# Create a transpose matrix from matrix P
print("Transpose : ")
print(np.transpose(P))

# Create an inverse matrix from matrix P
print("Inverse : ")
print(np.linalg.inv(P))

[[1 2]
 [3 4]]
Transpose : 
[[1 3]
 [2 4]]
Invers : 
[[-2.   1. ]
 [ 1.5 -0.5]]


## Statistical Operations on Arrays

In [None]:
stat_arr = np.array([7, 7, 6, 6, 5, 7, 5, 9, 4, 2, 4, 2, 7, 7, 7, 3, 4, 4, 8, 7, 8, 6,
       4, 1, 1])
stat_arr

array([7, 7, 6, 6, 5, 7, 5, 9, 4, 2, 4, 2, 7, 7, 7, 3, 4, 4, 8, 7, 8, 6,
       4, 1, 1])

In [None]:
# The average value of an array
print("Average value : ", np.mean(stat_arr))

# Maximum value of an array
print("Max value : ", np.max(stat_arr))

# Unique value of an array
print("Unique value : ", np.unique(stat_arr))

#Sort the values in the array
print("Sorted : ", np.sort(stat_arr))

Nilai rata-rata :  5.24
Nilai maks :  9
Nilai unik :  [1 2 3 4 5 6 7 8 9]
Sorted :  [1 1 2 2 3 4 4 4 4 4 5 5 6 6 6 7 7 7 7 7 7 7 8 8 9]


## Excercise

In [None]:
arr_1 = np.array([1,2,3.5,4,5,6,7,8])
arr_1.dtype

dtype('float64')

1. The data type obtained from <code>arr_1.dtype</code> of type float even though there are elements of type integer too, this is because there is data with the Float type, and the Integer type can still be classified as the Float type. So, in general this array can be called a Float array.

<hr>

In [None]:
V = np.array([[2,3,2],[1,2,1],[3,2,1]])
W = np.array([[1,1,2],[2,3,1],[2,2,2]])

2. There are matrices V and W as in _cell_ above, do the multiplication operation for matrices V and W! Then create a new matrix containing the roots of each element of the matrix multiplication result! Display the transpose matrix of the root matrix! (Hint : You can use the <code>sqrt</code> function from NumPy)

In [None]:
X = V * W
Y = np.sqrt(X)
Z = np.transpose(Y)
Z

array([[1.41421356, 1.41421356, 2.44948974],
       [1.73205081, 2.44948974, 2.        ],
       [2.        , 1.        , 1.41421356]])

<hr>

In [None]:
arr_3 = np.array([50, 20, 10, 10, 12, 60, 105,  2, 55, 80, 88, 16, 78, 13, 34, 34, 20,
        9, 51, 88, 19, 13,  3, 51, 45, 76, 41, 23, 34, 36, 55,  8, 55,  3,
        8,  2, 47, 27, 66,  3, 45, 96, 69, 21, 37, 32, 41, 43, 60, 53])

3. There is an array like _cell_ above, find and display the average, maximum, minimum, median, standard deviation, variance and size of the array!

In [None]:
ratarata = np.mean(arr_3)
maks = np.max(arr_3)
minimum = np.min(arr_3)
std = np.std(arr_3)
var = np.var(arr_3)
ukuran = np.size(arr_3)
display(ratarata)
display(maks)
display(minimum)
display(std)
display(var)
display(ukuran)

38.94

105

2

27.10897268433461

734.8964

50

# **C. Introduction to Scikit-learn**

[Scikit-learn](https://scikit-learn.org/) or **sklearn** is a Python library used for machine learning. This *library* provides various modules that support *supervised* and *unsupervised learning* (we will study this later). Several modules provided are useful for data preprocessing, model fitting, model selection and model evaluation.

## Load Dataset

The [datasets](https://scikit-learn.org/stable/datasets.html#datasets) module from sklearn provides a number of *toy datasets* that can be used to study machine learning. The dataset can be accessed with the <code>load_[dataset name] ()</code> function, the output is a scikit-learn dictionary which contains the components of the dataset.

In [None]:
# Accessing dataset "diabetes" from module of sklearn.datasets
from sklearn.datasets import load_diabetes

diabetes = load_diabetes()
print(diabetes.keys()) # View the components that the dataset contains

dict_keys(['data', 'target', 'DESCR', 'feature_names', 'data_filename', 'target_filename'])


We can access one of the components like accessing a dictionary

In [None]:
# Access the DESCR component to view a description of the dataset
print(diabetes.DESCR)

.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - Age
      - Sex
      - Body mass index
      - Average blood pressure
      - S1
      - S2
      - S3
      - S4
      - S5
      - S6

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).

Source URL:
https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see:
Bra

To use this dataset, we can create a dataframe from the components in the dictionary

In [None]:
# Create a dataframe from dataset components
import pandas as pd

df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
df['target'] = diabetes.target
df.head() # look at the first 5 rows

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.06833,-0.092204,75.0
2,0.085299,0.05068,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.02593,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641,135.0


## Cosine Similarity

Salah satu fungsi dari modul <code>metrics</code> adalah untuk menghitung cosine similarity

In [None]:
# Calculates the cosine similarity of 2 vectors
from sklearn.metrics.pairwise import cosine_similarity # import required sklearn functions
import numpy as np

vec_1 = np.array([[15,31,10,9,28]])
vec_2 = np.array([[35,21,6,7,12]])
print(cosine_similarity(vec_1, vec_2))

[[0.80982829]]


## Some Other Scikit-learn Modules

We will use scikit-learn modules a lot when creating _machine learning_ models. Here are some commonly used modules:

| Module Name | Uses |
| ----------- | --------- |
| sklearn.cluster | apply the clustering algorithm |
| sklearn.covariance | estimating feature covariance |
| sklearn.datasets | loads popular datasets provided by scikit-learn |
| sklearn.decomposition | apply a matrix decomposition algorithm such as PCA |
| sklearn.ensemble | apply multiple algorithms for classification and regression |
| sklearn.feature_extraction | extract features from raw data |
| sklearn.feature_selection | apply feature selection algorithm |
| sklearn.linear_model | apply several kinds of linear models |
| sklearn.metrics | calculating model performance with various metrics |
| sklearn.model_selection | implement cross-validation |
| sklearn.naive_bayes | applying the Naive Bayes algorithm for classification |
| sklearn.neighbors | applying the KNN algorithm |
| sklearn.pipeline | implement multiple model building steps in one pipeline |
| sklearn.preprocessing | performs label-encoding, scaling, normalization, and some other preprocessing |
| sklearn.svm | applying the Support Vector Machine algo |
| sklearn.tree | applying the Decision Tree algorithm |


More details can be seen in [scikit-learn module documentation](https://scikit-learn.org/stable/modules/classes.html).

## Exercise

In [None]:
X = [[1, 2], [2, 1]]
Y = [[0, 2], [4, 3]]

1. There are 2 matrices, X and Y, as above. Calculate the Euclidean distance and Manhattan distance in the two matrices using the scikit-learn module!

In [None]:
from sklearn.metrics.pairwise import euclidean_distances, manhattan_distances

test1 = euclidean_distances(X, Y)
test2 = manhattan_distances(X, Y)
print('Euclidean Distance:', test1)
print('Manhattan Distance:', test2)


Euclidean Distance: [[1.         3.16227766]
 [2.23606798 2.82842712]]
Manhattan Distance: [[1. 4.]
 [3. 4.]]
