# Introduction to Data Science PC Lab 01: Introduction to Python

Author: Jan Verwaeren - Arne Deloose

Course: Introduction to Data Science

Welcome, dear reader, to this short introduction to Python and Google Colab. In this notebook, we will cover the basics of Python, which you will need in your journey through data science. Each section of this notebook (there are three in this one) will first illustrate the principles followed by several exercises. But before we start, let us cover the basics of notebooks.

**What is a notebook?**

A notebook is a document that contains both text/visuals and code. This makes the code much more human-readable and easier to document. A notebook consists of individual cells which are either text cells such as the one you are reading right now, or code cells. The code cells will contain Python code (but other programming languages can also work with notebooks). If you want to put text inside a code cell, you will need to use a '#' to mark it as a comment. In Colab, you can insert a new cell using the insert menu at the top (try inserting a code cell and a text cell below here).

Well done. By clicking inside a cell, you can alter its contents. Executing a cell can be done using SHIFT + ENTER or by pressing the play button. Deleting a cell can be done with the icon on the right.

**Colab vs Jupyter/VScode**

In this course, we will use Colab, but you can also use a local program such as Jupyter or VSCode. However, some methods will be slightly different, because Colab cannot access local files on your pc. In other to deal with this, the code block below will  detect what you are using right now and adjust accordingly. Don't worry too much about how this code works just yet. 

In [1]:
try:
    import google.colab
    in_colab = True #set in_colab to True if you're running the code from google colab
except:
    in_colab = False #otherwise, it's false

**Saving notebooks**

Because Colab is a server based program, alterations you make to this notebook will not be saved locally. If you want to save this notebook with your own notes included, you will need to download it under file, download and select .ipynb. Saved notebooks can be uploaded again to Colab later. The notebooks we use will always be available on Github. Simply paste the Github link into Colab to load a notebook.

Now that we have dealt with this, let's get started with:

## 1. Datatypes

There are various datatypes in Python. Below, we will discuss some of the most important ones: numeric types (integers and floats), strings, lists and tuples and bools. Python has a flexible type system which allows for variables to be implicitely defined and converted when necessary to perform certain functions.

**ILLUSTRATION**

**1.a. numeric variables**

Let us start with definining a simple integer and printing out its type using the function type()

As can be seen, explicitely defining the type is not necessary, Python will automatically assume this is an integer based on its form.

In [2]:
#define a variable
a = 5
#print out the type
print(type(a))

<class 'int'>


If we want a float instead, we either have to write the variable in the form of a float (implicit), or explicitely give it the float type.

In [3]:
#define a float

#implicit
b = 5.0
print(type(b))

#explicit
b = float(5)
print(type(b))

<class 'float'>
<class 'float'>


As you can see, Python is very flexible with data types. Notice what happens if we add our float *b* and our int *a* together:

In [4]:
#add an int and a float together
c = a + b
#a will be automatically converted to a float
print(type(c))
print(c) #print the value

<class 'float'>
10.0


This makes Python code easy to write. However, do be careful, because your datatypes might change without Python telling you.

We can also explicitely convert data types, as can be seen below.

In [5]:
#convert types
d = 5.9 #float

#conversion
d1 = int(d) #convert to int
print(d1) #check value

#rounding
d2 = round(d)
print(d2)

5
6


Notice here that rounding and converting is not the same thing, so always be careful when you convert data types

Now let us move on to strings

**1.b. strings**

Once again, we can define a string implicitely or explicitely

In [6]:
str1 = 'hello'
print(type(str1))

str2 = str(5)
print(type(str2))
print(str2)

<class 'str'>
<class 'str'>
5


Be careful here. 5 and '5' are not the same thing. If you try to add *str2* together with *a*, you will get an error because Python cannot convert the variables into a compatible form (go ahead and try this below if you do not believe me).

Strings are saved as little arrays, which means it is possible to select specific letters. Note that Python indexing starts at 0, so 1:3 are the second and third letter of a string (endpoint is not included)

In [7]:
#indexing
#print out letters 2 and 3
print(str1[1:3]) #indexing starts at 0, we want positions 1 and 2 

el


We can add ('concatenate') strings together using + . Notice that we are overwriting str2 with a different value here

In [8]:
#add strings together
str2 = 'hello' + ' ' + 'world'
print(str2)

hello world


Strings have methods. These are built-in functions you can use on any string. They use the format:

str.method()

Where str is replaced with the name of your string and method the method. Methods have function inputs too. For example:

In [9]:
#string methods
#count how often the letter 'l' appears
result = str2.count('l')
print(result)

3


Or to convert to uppercase:

In [10]:
#convert to uppercase
str3 = str2.upper() #mind the brackets
print(str3)

HELLO WORLD


As you can see, a method can be given with empty input. However, the brackets are still necessary, otherwise you will get the function back instead of the output of the function.

Below, you can see an overview of some important string methods

<div align="center">String methods overview</div>

![alt text](files_IDS/stringmethods.png "String methods")

**1.c. lists and tuples**

If we want to have multiple elements in one variable, we can use a list or a tuple. Tuples and lists behave identically for the most part, however, tuples are immutable (elements cannot be replaced). Once again, we can define them implicitely or explicitely. Converting other variables is also possible, but be careful here, because Python might not always convert the way you expect.

In [11]:
#lists
list1 = list(str3)
list2 = ['apples', 'bananas', 'coconuts']
print(list1)
print(list2)

#tuples
tuple1 = tuple(str3)
tuple2 = ('apples', 'bananas', 'coconuts')
print(tuple1)
print(tuple2)

['H', 'E', 'L', 'L', 'O', ' ', 'W', 'O', 'R', 'L', 'D']
['apples', 'bananas', 'coconuts']
('H', 'E', 'L', 'L', 'O', ' ', 'W', 'O', 'R', 'L', 'D')
('apples', 'bananas', 'coconuts')


As you can see, converting a string will split it into letters by default. 

We can access elements of a list using indexing with square brackets [ ]

In [12]:
#indexing
print(list2[1]) #access a word
print(tuple2[1])

bananas
bananas


If an element of a list is a string, we can double index to access letters within this string

In [13]:
#double indexing
print(list2[1][2:]) #second word, select letters from 3 to the end of the string

nanas


Modifying a list can be done by assigning a new element to a position. We can also use list methods such as *append* to add an element.

In [14]:
#replace element
list2[1] = 'oranges'
print(list2)

['apples', 'oranges', 'coconuts']


As you can see below, *append* is an 'in-place' method. It directly modifies the list without giving an output

In [15]:
#list methods
#append

list2.append('pears') 
print(list2)

['apples', 'oranges', 'coconuts', 'pears']


**Note**: Tuples are immutable, so assigning or appending will not work

The number of elements in a list can be found with the function len() (this will also work on a string)

In [16]:
#number of elements in a list
print(len(list2))

4


List elements can be the same type, but we can mix types as well

In [17]:
#mixing var types
list4 = [5, 'apple', 4.20]
print(list4)

[5, 'apple', 4.2]


Finally, the function split can be used to seperate a string into list elements. A seperator or 'delimiter' must be provided

In [18]:
#split a string into a list
str4 = 'this,is,a,test,string'
print(str4.split(',')) #split into list using , as sep

['this', 'is', 'a', 'test', 'string']


**1.d. Bools**

Finally, we have bools. Bools are either True or False. If you make a comparison, the result will be a True or a False (or multiple values, depending on what you are comparing)

In [19]:
bool1 = True
print(type(bool1))

<class 'bool'>


Bigger then or smaller then works in the following way

In [20]:
#compare numbers
print(3<4) #less than
print(3>4) #more than
print(3<=3) #less or equal
print(3>=4) #more or equal

True
False
True
False


Equal or not equal works like this. Mind the double ==. This is necessary to distinguish between variable assignments and equality tests

In [21]:
#compare numbers
print(3==4) #equal, 
print(3!=4) #not equal

False
True


In [22]:
#comparisons auto-convert ints and floats
print(int(3)==float(3))

True


If we want to check if a substring appears in a string, we can use *in*. *In* can also be used to see if an element is present in a list

In [23]:
#in operator
print('hello' in str2)
print('hello' in str3) #str3 is capitalised
print('hello' in str3.lower()) #convert str3 back to lowercase

True
False
True


**EXERCISES**

Now that we have seen how everything works, you should be able to solve these exercises

**Exer 1**

Below, a course name is given. Convert this name so the first letter is capitalised. Do this in a general way that works for every name.

Hint: is there a string method that can help you here?

In [24]:
CourseName = 'process engineering'

In [25]:
CourseName = CourseName.capitalize()
CourseName

'Process engineering'

The course name contains two words. Capitalize both of them. 

Hint: strings can be split based on spaces

In [26]:
temp = CourseName.split()
CourseName = temp[0].capitalize() + ' ' + temp[1].capitalize()
CourseName

'Process Engineering'

**Exer 2**

Below, a list of some important model organisms used in research is given. Using this as the basis, create a new list that contains the shortened name instead (initial + genus). So instead of 'Escherichia coli', we want 'E. coli'.

In [27]:
#define list
model_organism = ['Escherichia coli', 'Arabidopsis thaliana', 'Drosophila melanogaster'] 

In [28]:
#solution
model_organism_short = list() #pre-allocate list
temp = model_organism[0].split(' ') #split on space
model_organism_short.append(temp[0][0] + '. ' + temp[1]) #add together
#repeat for the others
temp = model_organism[1].split(' ') #split on space
model_organism_short.append(temp[0][0] + '. ' + temp[1]) #add together
temp = model_organism[2].split(' ') #split on space
model_organism_short.append(temp[0][0] + '. ' + temp[1]) #add together

print(model_organism_short)

['E. coli', 'A. thaliana', 'D. melanogaster']


## 2. Control structures

In this second part, we will discuss control structures. We will discuss IF, FOR and WHILE.

**ILLUSTRATION**

**2.a. IF**

With the if statement, a block of code will be executed if a specific condition is met (condition is True). The following form is used:

    if condition:
        code block
    
Mind the indentation here. The code block must be indented (shifted to the right) so Python knows which part of the code is inside the *if* statement. 

In [29]:
#if
num = 45

#if
if num > 25:
    print("Number is greater than 25") #indented
print("This part is not inside the if statement") #not indented

Number is greater than 25
This part is not inside the if statement


With *if else*, we can supply two blocks of code. If the condition is True, the *if* block is executed, if it is False, the *else* block is executed.

In [30]:
#if-else
if num % 2 == 0: #remainder after division is zero
    print("Number is even")
else:
    print("Number is odd")

Number is odd


Using *elif*, we can check multiple conditions in a row. If the first condition is not met, the next *elif* is checked. If they are all False, *else* is executed.

In [31]:
# if-elif-else
# any number of elif can be used
if num < 0:
    print("Number is a negative number")
elif num > 0:
    print("Number is a positive number")
else:
    print("Number is zero")

Number is a positive number


Multiple *if* statements can be nested together

In [32]:
#nested ifs
num = 40

#if
if num > 25:
    if num % 2 == 0: 
        print("Number is greater than 20 and even")

Number is greater than 20 and even


We can accomplish the same thing using the *and* statement to combine both conditions. *Or* can also be used.

In [33]:
#and
num = 40

if num > 25 and num % 2 == 0: 
    print("Number is greater than 20 and even")

Number is greater than 20 and even


**2.b. FOR**

Using *for*, we can execute a block of code a fixed number of times. *For* uses an iterator that changes every loop. There are different types of iterators. Below, you can see numeric iterators and list iterators

In [34]:
#for, range
#use only endpoint
for i in range(5):
    print(i)

0
1
2
3
4


In [35]:
#for
#start and endpoint
for i in range(3, 5):
    print(i)

3
4


In [36]:
#for
#start and endpoint + step
for i in range(2, 10, 2): #from 2 to 10 in steps of 2
    print(i)

2
4
6
8


In [37]:
#for
#using a list
list_fruit = ['apples', 'bananas', 'coconuts']
for fruit in list_fruit:
    print(fruit)

apples
bananas
coconuts


**2.c. WHILE**

With *while*, a piece of code is executed until a condition is met

In [38]:
#while
num = 0
while num<10:
    num+=1 #add 1 to number
print(num)

10


**EXERCISES**

**Exer 1**

The Collatz conjecture is a famous unsolved math problem. Given a number (positive integer), the following operation is performed:
* If the number is even: divide by 2
* If it is odd: multiply by 3 and add 1

If the new number is not equal to 1, the same rules are followed again. Given these rules, will the sequence always converge to 1?

Write a program that executes these operations for a given number. Using *input*, you can prompt the user for a number (note that this will be given as string by default). Count the number of operations performed and write this back to the user.

EXTRA: adapt your code to give an error message when an invalid input is given (not a positive integer). For this, you can use the function isnumeric() to detect whether an input is a positive integer (this works on a string as well). Keep in mind that 0 is an invalid input as well.

In [39]:
current_num = input('Give a number: ')
current_num = int(current_num)
num_op = 0 #number of operations
while current_num!=1:
    if current_num % 2 == 0: #remainder after division is zero => even
        current_num=current_num/2
    else:
        current_num=current_num*3 + 1
    num_op+=1
print('Converged')
print(str(num_op) + ' operations were performed.')

Give a number: 5
Converged
5 operations were performed.


In [40]:
#with input checking

current_num = input('Give a number: ')
if current_num.isnumeric():
    current_num = int(current_num)
    if current_num<1:
        print('Number must be positive')
    else: #run the code
        num_op = 0 #number of operations
        while current_num!=1:
            if current_num % 2 == 0: #remainder after division is zero => even
                current_num=current_num/2
            else:
                current_num=current_num*3 + 1
            num_op+=1
        print('Converged')
        print(str(num_op) + ' operations were performed.')
else:
    print('Invalid input')


Give a number: 15
Converged
17 operations were performed.


**Exer 2**

Replace the drug names in the list below by their capitalised version (you can use the *capitalize* string method).

In [41]:
#define list
DrugNames = ['adderall', 'mydayis', 'ritalin', 'concerta']

In [42]:
for i in range(len(DrugNames)):
    DrugNames[i] = DrugNames[i].capitalize()
print(DrugNames)

['Adderall', 'Mydayis', 'Ritalin', 'Concerta']


## 3. File input/output

Finally, we will briefly discuss loading files into a notebook. There are various methods to do this, but some of them work better in Colab vs a local program. In general, there are two methods: network locations (Github) and locally

**ILLUSTRATION**

**3.a  Network location (such as Github)**

Option 1 (colab recommended): clone the entire repository from Github

After the clone command, the repo will show up in the sidebar (select the files icon). It can now be accessed as if it's a local file

In [43]:
#this method should not be used outside google colab
if in_colab:
    #first clone the repo (insert the link below)
    !git clone https://github.com/jverwaer/IntroDataScience
    #navigate to the repo and then use / to select files inside the repo
    #open the file
    f = open('IntroDataScience/PCLabs/files_IDS/iris.csv', 'r')
    #read the lines
    print(f.readlines())

Option 2 (works everywhere): load raw files using the direct link

- Click on the file in the repository on Github, 
- Click on 'View Raw',
- Copy the URL of the raw file, 
- Use this URL as the location of your file. 

Some functions can work directly with a URL. However, the function *open* cannot do this. Therefore, we need to use the requests library.


In [44]:
#load a file directly from Github
url='https://raw.githubusercontent.com/jverwaer/IntroDataScience/main/PCLabs/files_IDS/iris.csv' #raw github link

#load requests library
import requests

#use requests to load the file from the url
f = requests.get(url)
#print out the text
print(f.text)

sepal length;sepal width;petal length;petal width;soort
5.1;3.5;1.4;0.2;setosa
4.9;3.0;1.4;0.2;setosa
4.7;3.2;1.3;0.2;setosa
4.6;3.1;1.5;0.2;setosa
5.0;3.6;1.4;0.2;setosa
5.4;3.9;1.7;0.4;setosa
4.6;3.4;1.4;0.3;setosa
5.0;3.4;1.5;0.2;setosa
4.4;2.9;1.4;0.2;setosa
4.9;3.1;1.5;0.1;setosa
5.4;3.7;1.5;0.2;setosa
4.8;3.4;1.6;0.2;setosa
4.8;3.0;1.4;0.1;setosa
4.3;3.0;1.1;0.1;setosa
5.8;4.0;1.2;0.2;setosa
5.7;4.4;1.5;0.4;setosa
5.4;3.9;1.3;0.4;setosa
5.1;3.5;1.4;0.3;setosa
5.7;3.8;1.7;0.3;setosa
5.1;3.8;1.5;0.3;setosa
5.4;3.4;1.7;0.2;setosa
5.1;3.7;1.5;0.4;setosa
4.6;3.6;1.0;0.2;setosa
5.1;3.3;1.7;0.5;setosa
4.8;3.4;1.9;0.2;setosa
5.0;3.0;1.6;0.2;setosa
5.0;3.4;1.6;0.4;setosa
5.2;3.5;1.5;0.2;setosa
5.2;3.4;1.4;0.2;setosa
4.7;3.2;1.6;0.2;setosa
4.8;3.1;1.6;0.2;setosa
5.4;3.4;1.5;0.4;setosa
5.2;4.1;1.5;0.1;setosa
5.5;4.2;1.4;0.2;setosa
4.9;3.1;1.5;0.2;setosa
5.0;3.2;1.2;0.2;setosa
5.5;3.5;1.3;0.2;setosa
4.9;3.6;1.4;0.1;setosa
4.4;3.0;1.3;0.2;setosa
5.1;3.4;1.5;0.2;setosa
5.0;3.5;1.3;0.3;setosa
4

**3.b Local files**

In the sidebar, navigate to the files icon and select *upload*. Then you can navigate to this file.

Outside of colab, uploading is not necessary if the file is in the correct location (same folder as notebook). Otherwise, you need to copy the correct path.

Other paths can be set using the following commands:

    import os
    os.chdir('c:\\temp\\files_IDS') #navigate to the temp folder

Keep in mind that setting a path like this will make the notebook less reproducable (a location might not exist on a different computer).

If we are working with a local program, there's no need to upload anything, we can simply navigate to the correct location to load in the file

In [45]:
#google colab
if in_colab:
    f = open('iris.csv', 'r') #remember to upload the file first
    print(f.readlines())
else: #locally
    f = open('files_IDS/iris.csv', 'r') #location (relative path)
    print(f.readlines())

['sepal length;sepal width;petal length;petal width;soort\n', '5.1;3.5;1.4;0.2;setosa\n', '4.9;3.0;1.4;0.2;setosa\n', '4.7;3.2;1.3;0.2;setosa\n', '4.6;3.1;1.5;0.2;setosa\n', '5.0;3.6;1.4;0.2;setosa\n', '5.4;3.9;1.7;0.4;setosa\n', '4.6;3.4;1.4;0.3;setosa\n', '5.0;3.4;1.5;0.2;setosa\n', '4.4;2.9;1.4;0.2;setosa\n', '4.9;3.1;1.5;0.1;setosa\n', '5.4;3.7;1.5;0.2;setosa\n', '4.8;3.4;1.6;0.2;setosa\n', '4.8;3.0;1.4;0.1;setosa\n', '4.3;3.0;1.1;0.1;setosa\n', '5.8;4.0;1.2;0.2;setosa\n', '5.7;4.4;1.5;0.4;setosa\n', '5.4;3.9;1.3;0.4;setosa\n', '5.1;3.5;1.4;0.3;setosa\n', '5.7;3.8;1.7;0.3;setosa\n', '5.1;3.8;1.5;0.3;setosa\n', '5.4;3.4;1.7;0.2;setosa\n', '5.1;3.7;1.5;0.4;setosa\n', '4.6;3.6;1.0;0.2;setosa\n', '5.1;3.3;1.7;0.5;setosa\n', '4.8;3.4;1.9;0.2;setosa\n', '5.0;3.0;1.6;0.2;setosa\n', '5.0;3.4;1.6;0.4;setosa\n', '5.2;3.5;1.5;0.2;setosa\n', '5.2;3.4;1.4;0.2;setosa\n', '4.7;3.2;1.6;0.2;setosa\n', '4.8;3.1;1.6;0.2;setosa\n', '5.4;3.4;1.5;0.4;setosa\n', '5.2;4.1;1.5;0.1;setosa\n', '5.5;4.2;1.4;0

**EXERCISES**

**Exer 1**

Get the raw Github links for iris_features and iris_labels and read both files seperately. 

In [46]:
url_X='https://raw.githubusercontent.com/jverwaer/IntroDataScience/main/PCLabs/files_IDS/iris_features.csv' #raw github link
url_Y='https://raw.githubusercontent.com/jverwaer/IntroDataScience/main/PCLabs/files_IDS/iris_labels.csv' #raw github link

f = requests.get(url_X)
X = f.text
f = requests.get(url_Y)
Y = f.text

**Exer 2**

Using the data you loaded, extract the petal width of the third flower.

In [57]:
Xdata = X.split() #split into data rows
flower = Xdata[3].split(',') #split data row into entries
petallength = flower[2] #get the third entry (PetalLength)
print(petallength)

1.3
