<a href="https://colab.research.google.com/github/pbjorda27/IT6203/blob/master/intro_to_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

First we import the **pandas** library which is needed for data import and processing. The **pd** is an ***alias*** that we can later use to refer to the pandas library without having to repeat its full name

In [None]:
import pandas as pd

To read data from a comma-delimited file, use **pandas.read_csv()**. You need to make sure data.csv is already uploaded.

We will store the **dataframe** in a **variable** called ***students***

In [None]:
students = pd.read_csv('data.csv')

After creating a variable, we can refer to it to get its contents. In this case, the contents of students is the dataframe loaded from data.csv

In [None]:
students

Unnamed: 0,student_id,first_name,last_name,major,study_time,GPA
0,202005537,Eunice,Ehmann,,7.0,3.5
1,202008560,Hobert,Schoenberger,,5.0,3.2
2,202004948,Nicholas,Sizer,,6.8,2.4
3,202001207,Elvin,Foulks,,4.6,3.1
4,202000260,Bruno,Viney,,7.3,3.6
5,202003083,Alan,Borg,,6.8,3.5


Variables in Python can store anything, from numbers, strings, lists, to dataframes and machine learning models

By the way, you can have multiple statements in a code cell. When the cell runs, all statements will execute in the order you put them

In [None]:
a_number = 10
another_number = 20
a_string = 'hello world'
a_list = [10,20,30,50]

Numeric variables can form math operations with each other. We have
- \+ adding
- \- subtracting
- \* multiplying
- / dividing
- \*\* power
- // dividing then keeping only integer part
- % dividing then keeping only remainder (modulo)





In [None]:
a_number + another_number

30

In [None]:
a_number // 3

3

In [None]:
a_number % 3

1

Strings in Python represent texts. They can have any lengths an contain any characters

In [None]:
a_string

'hello world'

Lists in Python are collections of items

In [None]:
a_list

[10, 20, 30, 50]

We can access items in lists using index - the position of the item in the list. Indexes start from 0 for the first item, and end at list's length - 1.

In [None]:
a_list[1]

20

Let's get back to data analytics. We will try training our first model for predicting GPA based on study time. For common machine learning models we will use Scikit-learn, or **sklearn**.

We will try a linear regression model. This is still an example on variables. We will discuss this model in details later on.

One by one, the cell below does the following steps
1. Import LinearRegression from the linear_model module of sklearn
2. Create a new model and store it in the lr_model variable
3. Train it on the student data by calling ***fit()*** from lr_model.

In [None]:
from sklearn.linear_model import LinearRegression

lr_model = LinearRegression()
lr_model.fit(students.loc[:,['study_time']], students.loc[:,['GPA']])

After training, we can access the various learned parameters of the model through the variable that stores it, in this case, lr_model.

In this case, the model will be an equation in the form

$GPA = a + b*study\_time$

a can be obtained with $intercept\_$, and b $coef\_$

In [None]:
lr_model.intercept_, lr_model.coef_

(array([2.70655988]), array([[0.08161709]]))

So the trained model is

$GPA = 2.349 + 0.166*study\_time$

Now we will try predicting GPAs of new students. Make sure to have test_data.csv uploaded. Then, the cell below will load the test data into student_test variable

In [None]:
students_test = pd.read_csv('test_data.csv')
students_test

Unnamed: 0,student_id,first_name,last_name,major,study_time
0,202005527,Gabrielle,Davis,IT,6.0
1,202820250,Helen,Wilson,CS,5.6
2,202004768,Ian,Moore,SWE,7.4
3,202002209,Jack,Thomas,IT,3.4
4,202000310,Kayla,Harris,CS,4.1


We make prediction using a trained model using ***predict()***. We will store the results in students_test as a new column GPA_predict

In [None]:
students_test['GPA_predict'] = lr_model.predict(students_test.loc[:,['study_time']])

We can view the edited dataframe

In [None]:
students_test

Unnamed: 0,student_id,first_name,last_name,major,study_time,GPA_predict
0,202005527,Gabrielle,Davis,IT,6.0,3.196262
1,202820250,Helen,Wilson,CS,5.6,3.163616
2,202004768,Ian,Moore,SWE,7.4,3.310526
3,202002209,Jack,Thomas,IT,3.4,2.984058
4,202000310,Kayla,Harris,CS,4.1,3.04119


Results can be save to a new csv file by calling ***to_csv()*** from the dataframe variable. You may need to close then reopen the folder tab to see the new file added.

In [None]:
students_test.to_csv('predicted_data.csv', index=False)

### **IMPORTANT**: user files are **always** removed after your session. So, always remember to save/download your notebooks and any output files after a session.

One way to make this process less inconvenient is to connect a session to your google drive. You can then access/save your notebooks and results faster.

First, import drive from google.colab and mount your drive. You will be prompted to allow google colab to connect

To test the code below, create a folder "IT7103 Module 2" in your google drive. Upload colab_and_drive notebook to the folder. After running the cell, refresh your Files tab and verify if you can see the folder and the notebook in this current session

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


To change save location to your drive folder, add '/content/drive/MyDrive/IT7103 Module 2/' to the file path. Verify if you see the new file created in your drive folder

In [None]:
students_test.to_csv('/content/drive/MyDrive/IT7103 Module 2/predicted_data.csv', index=False)

Finally, save this notebook, then close the session by Runtime -> Disconnect and delete runtime

Then, move to the colab_and_drive notebook for some final testing