# SMU Master of Science (Economics) Programming Workshop in Python


## Introduction
This is an introductory course/workshop in Programming with Python, aimed at achieving the following objectives:
1. Give a brief introduction/overview of Python
2. Equip students with foundational programming knowledge for the MSc Programme in Economics

Throughout the duration of the course, we will learn programming using an application-based approach. For most of the course, students will be required to write code (in the form of in-class assignments) - this is probably the best way to learn coding/programming. 

More often than not, we are not interested in whether you got the answer right or wrong - we are more interested in the way you think about the question, and your approach towards solving it.

#### Course Outline
The course will be split into the following modules:

1. Introduction, Syntax, Operations and Data Structures
2. Loops, Conditional Statements, Functions and Classes
3. NumPy and Pandas - Data Cleaning, Reading and Manipulation
4. Matplotlib and Seaborn - Data Plotting
5. Algorithms

In today's class, we will be covering the first module on syntax, operations and data structures.

In [None]:
print
len
int
float
str
x
y

## Basic Syntax and Operations

In the earlier days of programming, humans have to key in specific code (machine-readable code) that the machines can interpret in order for them to execute the action specified. For example, Fortran and C are 2 such "low-level" languages - they are languages that can be easily interpreted by the computer, but not necessarily easy for humans to understand:

<img src="images/fortran_code.jpg">

However, in recent times, for sake of simplicity and convenience, newer languages, such as R and Python, are "higher-level" languages - they have a build-in interpreter that translates what we provide as code/inputs for the machine to interpret/execute

Then, one benefit of newer languages such as Python and R stems from its readability and simplicity, but comes at the cost of longer computation. This wouldn't be an issue as long as we are not dealing with *Big Data*.

In this chapter, we will start off with a quick introduction to the syntax (or grammar) and basic operations of the Python language - this will give you a taste of what Python can and cannot do.

<img src="images/python_code.jpeg">

### Variable Assignment

In the following example, we assign the value 200 * 17 to the variable "x", and then we use the print function to print its value. Note that we can also assign strings, which are also known as text, but we will have to use quotation marks.

In [5]:
# Problem 1 (Note that to make comments in Python, we can include a #)
x = 200 * 17
print(x) 

y = "Hello World."
print (y)

3400
Hello World.


### Type 

Note that in the above code block, there are 2 lines of output. The function, `print`, takes an argument, and prints it out. Below, we can use the `print` function along with the `type` function, to learn more about the type of the variables.

In [2]:
print(type(x))
print(type(y))

<class 'int'>
<class 'str'>


In [6]:
z = 'ABC'
type(z)

str

In [8]:
a = 10.0/3
type(a)

float

Not surprisingly, '3400' was assigned to the type, `int` while 'Hello World' was assigned to the type `str`. There are many different data types in Python but the main ones are as follows:

1. int - integers
2. float - floating point (decimals)
3. str - string variables

### Concatenating Strings and Integers

After seeing those examples, you may have the following questions in mind:
1. Can we add a string to an integer?
2. Can we add a string to a string?
3. Can we do multiplication with a string?

As it turns out, one of the best ways to understand more about programming is to **actively** code. Another way to learn more about programming is to learn to ask the right questions (most programmers use StackOverflow). 

Below, we answer the 3 questions in sequential order.

In [11]:
# Question 1
123 + 'ABC'

TypeError: unsupported operand type(s) for +: 'int' and 'str'

In [12]:
'123'+'ABC'

'123ABC'

As it turns out, we cannot do so. This is also our first example of an error; in this case, a TypeError. To learn more about the error, we can look at the text that comes after the error: essentially, the error says that we cannot add integers and strings together. However, there is a way to circumvent this, but we digress from this question, and move on to the second question: can we add a string to a string?

In [17]:
y = 'Hello World.'
y+y

'Hello World.Hello World.'

It turns out that we can add a string to a string. However, note that since we did not assign another variable to store the value returned by the addition of the variables, we cannot retrieve the values. This implies the following: whenever you're interested in a result, or think that you may need it again, save it! This reduces the complexity of your program, since you do not have to run the same function over and over again. 

---
##### Naming Conventions and Comments
It is pretty important to know how to name your variables. 

Suppose you're naming a variable which contains the mean of the GDP variable, one sensible name will be to name it **gdp_mean**, instead of **x**. While this distinction seems pretty trivial right now, when you begin to write longer chunks of code, naming conventions make it easy for other readers to know what you're doing, and for you to follow what you're actually doing. In addition, having variables with uninformative names makes it difficult to carry on writing code if you need to call these variables in the future.

Writing comments also help your readers understand what each block of code (be it a loop, function or conditional statements) are doing. As you begin to write more and more code, and begin to call previously defined functions, it becomes very difficult for your readers (and you) to follow what's going on. Writing comments can often help to reduce the reading burden of your readers.

---

And back to the main course. Adding 2 strings together essentially implies appending them to one another. Because of this, we can also do multiplication with a string.

In [22]:
# the following code block multiplies y by 10 times
y * 10

'Hello World.Hello World.Hello World.Hello World.Hello World.Hello World.Hello World.Hello World.Hello World.Hello World.'

In [5]:
3 * y

'Hello World.Hello World.Hello World.'

### Mathematical Operations in Python

In what follows, I will show basic mathematical operations for integers in Python, including:

1. Addition
2. Subtraction
3. Multiplication
4. Division
5. Exponential

In [26]:
3 + 6

9

In [27]:
4 - 7

-3

In [28]:
5 * 3

15

In [31]:
x = 2.5
type(x)
int(x)

2

In [32]:
j = 5 / 3
print(type(j), j)

<class 'float'> 1.6666666666666667


This is our first instance of a floating point variable. Floating point work the same way as integers; we can carry out multiplication, addition, division etc. on them. However, when we try to convert a floating point to an integer, Python rounds it down. We will learn more about type conversion in the next section.

In [33]:
int(j)

1

In [11]:
2 ** 3

8

### Type Conversion

In practice, one can convert the integer type variable to a string type. One way to do so is to use the `str` function. The `str` function takes a variable (integer, string, float) and converts it to a string variable. On the contrary, you cannot use the `int` function on strings, unless the string contains only numbers. 

Below, we show some examples of what can and cannot be done.

In [34]:
x = 200 * 17
print(x, type(x))

3400 <class 'int'>


In [42]:
k = str(x)
print(type(k))

<class 'str'>


In [38]:
z = int(k)
print(type(z))

<class 'int'>


In [40]:
z

3400

In [39]:
y = "Hello World."
int(y)

ValueError: invalid literal for int() with base 10: 'Hello World.'

Here, we have the second error: a ValueError. The text that follows after the error says that Python was unable to convert a text, in this case, "Hello World", to an integer. No surprises here.

## Data Structures

So far, we have seen quite a number of data types. How do we store these data types in a structure that is easily accessible, yet manipulable? 

It turns out that there are many different types of data structures that provides us with a schema for doing so. For example, we have ordered lists, tuples and more. Listed below are typical data structures used:

1. List (ordered list)
2. Sets
3. Dictionaries (hash-tables; key-value pairs)
4. Tuples

In this subsection, we learn more about the first 3 types of data structures.

### Lists
In what follows, we give a somewhat brief summary of what lists are. Essentially, a list is a collection of variables that has the following properties: 

1. They are **ordered**
2. They are **mutable**
3. They allow for duplicate members (taken from [here](https://www.w3schools.com/python/python_lists.asp)). 

Finally, lists can be constructed in Python using either the square brackets, [ ] or the function, `list()`.

We can also change the elements in a list - add new members to a list, remove members from a list and switch elements within the list.

In [46]:
lst = [25, 36, 42, 17, 19, 19, 51]
print(lst)
len(lst)

[25, 36, 42, 17, 19, 19, 51]


7

In [47]:
x = []
print(type(x))
print(len(x))

<class 'list'>
0


One interesting aspect of Python is that it uses zero-indexing i.e. the first element in the list is denoted by the index 0 instead of 1.

In the next code block, we demonstrate how to extract an element of the list. We can use the square brackets, [0], to query the 0-th element of the list.

In [51]:
lst[2]

42

In [53]:
lst[-1]

51

In [60]:
# Extract first 5 elements
lst[:5]

# Extract last 5 elements
lst[-5:]

[42, 17, 19, 19, 51]

In [61]:
x = ["Alex", "Bob", "Charlie"] # assign the variable x to a list containing 3 names
print("The first element in the list is " + x[0] + ".")

The first element in the list is Alex.


In [62]:
x[1]

'Bob'

In [63]:
x[-1]

'Charlie'

Similar to strings, when we concatenate 2 lists together, Python appends the list, `y`, to the back of list `x`.

That is, it creates a new list `[[x[0], x[1], ...], [y[0], y[1], ...]`. 

Thus, the order of the operation matters: x + y yields a different output compared to y + x. Similar to strings as well, we can also multiply lists. Multiplying lists concatenates the lists together.

In [66]:
lst2 = [15, 23, 40]
lst + lst2

lst3 = [15, 214, "Alex", "Bob", "C"]
lst3

[15, 214, 'Alex', 'Bob', 'C']

In [67]:
y = ["Diane", "Elaine"] # assigns the variable to a list containing 2 names
print(x+y)
print(len(x+y))

['Alex', 'Bob', 'Charlie', 'Diane', 'Elaine']
5


In [56]:
x * 2

['Alex', 'Bob', 'Charlie', 'Alex', 'Bob', 'Charlie']

Lists are a simple way of storing and extracting information. One can easily verify whether an element is in a list, using the list method, `in`. 

Python checks whether the element is in the list, and returns True if the element is in the list and False otherwise. However, note that this method is **sensitive to capital letters**, as evident from the following examples.

Note: Methods and functions are really similar, although there are some fundamental differences between methods and functions. For example, methods are unique to functions. This will be discussed in a future lesson.

In [68]:
print(x)

['Alex', 'Bob', 'Charlie']


In [71]:
"alex" in x

False

In [70]:
"Bobby" in x

False

In [73]:
"ALEX".lower()

'alex'

In [74]:
[i.lower() for i in x]

['alex', 'bob', 'charlie']

In [75]:
"ALEX".lower() in [i.lower() for i in x]

True

In [78]:
lst3 = lst + lst2
lst3

[25, 36, 42, 17, 19, 19, 51, 15, 23, 40]

Lists can perform a variety of operations, which are referred to as methods (you can check them out [here](https://www.programiz.com/python-programming/methods/list)). We will go through some of them in this lesson:

1. append
2. extend
3. insert
4. remove
5. reverse
6. sort

Note that these methods are non-reversible. After applying these methods, the list gets altered. Once you perform an operation (method), the list will be different from what it used to be.

In [100]:
# The list method, append, add an element to the list (note than an element of a list can be a list itself)
x = ['Alex', 'Bob', 'Charlie']
print(x)
x.append(["Daniel", "Edgar"])
print(x)
len(x)

['Alex', 'Bob', 'Charlie']
['Alex', 'Bob', 'Charlie', ['Daniel', 'Edgar']]


4

In [101]:
x

['Alex', 'Bob', 'Charlie', ['Daniel', 'Edgar']]

In [102]:
x[-1]

['Daniel', 'Edgar']

In [103]:
x[-1][1]

'Edgar'

In [104]:
y = ['Diane', 'Elaine']

In [105]:
# The list method, extend, extends the list to include the elements of another list
print(x, y)
x.extend(y)
print(x)

['Alex', 'Bob', 'Charlie', ['Daniel', 'Edgar']] ['Diane', 'Elaine']
['Alex', 'Bob', 'Charlie', ['Daniel', 'Edgar'], 'Diane', 'Elaine']


In [106]:
x

['Alex', 'Bob', 'Charlie', ['Daniel', 'Edgar'], 'Diane', 'Elaine']

In [107]:
# The list method, insert, inserts an element to the list 
# (requires 2 arguments - 1st argument is the position, and 2nd argument is the element)
x.insert(2, "Kai")
print(x)

['Alex', 'Bob', 'Kai', 'Charlie', ['Daniel', 'Edgar'], 'Diane', 'Elaine']


In [96]:
alp = ['A', 'B', 'D']
alp.insert(2, ['C+', 'C', 'C-'])
alp

['A', 'B', ['C+', 'C', 'C-'], 'D']

In [108]:
# The list method, remove, removes an element from the list, if it is present
x = ['A', 'B', 'C', 'D']
x.remove('B')
print(x)

['A', 'C', 'D']


In [109]:
x.remove('B')

ValueError: list.remove(x): x not in list

In [110]:
x.insert(1, 'B')
x

['A', 'B', 'C', 'D']

In [111]:
# The list method, reverse, reverses the entire list index
x.reverse()
print(x)

['D', 'C', 'B', 'A']


In [112]:
# The list method, sort, sorts the list by numerical or alphabetical order
x.sort()

In [113]:
x

['A', 'B', 'C', 'D']

In [117]:
z = ['Apple', '100', '50', '9', 'Sea']
z.sort()

In [118]:
z

['100', '50', '9', 'Apple', 'Sea']

In [120]:
z = ['A', ['B', 'C']]
z.reverse()

In [121]:
z

[['B', 'C'], 'A']

In [93]:
z.remove(['B', 'C'])

In [95]:
z + ['B', 'C']

['A', 'B', 'C']

As it turns out, you cannot sort a list that contains both lists and strings. 

In-class assignment:

Can you sort a list if it has both integers and strings? Use an example to show this.

In [29]:
# Your code here

In [30]:
x.remove(['Daniel', 'Edgar'])
x.sort()
print(x)

['Alex', 'Bob', 'Charlie', 'Elaine', 'James']


Now that we have some experience with lists, here's a quick in-class assignment:

1. Create 2 lists: one that contains the numbers, 3, 5, 7 and another that contains strings "Helen", "Jake", "Betty".
2. Add the integer 4 to the first list so that it is the second element, and the string "Sarah" to the second one such that it is the first element in the second list.
3. Sort both lists in reverse order (i.e. first string should return [7, 5, 4, 3])
4. Append the second list to the first list.
5. Check if you can sort them.

Note: there are many ways to do this problem.

In [129]:
# Your code here
x = [3, 5, 7]
y = ['Helen', 'Jake', 'Betty']

x1 = [x[0], 4] + x[1:]
print(x1)
y1 = ['Sarah'] + y
print(y1)

x1.reverse()
y1.reverse()

z = x1+y1
print(z)

[3, 4, 5, 7]
['Sarah', 'Helen', 'Jake', 'Betty']
[7, 5, 4, 3, 'Betty', 'Jake', 'Helen', 'Sarah']


In [104]:
z.sort()

TypeError: '<' not supported between instances of 'str' and 'int'

### Sets

A set is a collection of variables and has the following properties: 
1. They are **unordered** 
2. They are **unindexed**
3. They do not not allow for duplicate members. 

That is, sets contain only **unique** values. To create a set, we can use the curly brackets, { }, or the function, set().

In [130]:
x = ["A", "B", "C", "A"]
set_x = set(x)
set_x[0]

TypeError: 'set' object does not support indexing

In [133]:
set_a = set()
print(type(set_a))
set_a = {"Apples", "Bananas", "Oranges", "Pears", "Apples"}
set_a

<class 'set'>


{'Apples', 'Bananas', 'Oranges', 'Pears'}

In the previous code block, we created a set with 5 elements, but the element "Apples" appears twice in the set. 

Noting that sets contain only unique values, we should expect `set_a` to have 4 elements

In [134]:
set_a

{'Apples', 'Bananas', 'Oranges', 'Pears'}

Similar to lists, one can check whether an element is in the set using the "in" function, but note that it is case-sensitive.

In [137]:
print("Bananas" in set_a)
print("bananas" in set_a)

# The following code is robust to errors resulting from different cases
"BANANAS".lower() in [fruit.lower() for fruit in list(set_a)]

True
False


True

In [138]:
print("bananas" not in set_a)

True


To add an element to the set, we can use the set method, "add". To add more than one element, we can use the set method, "update", to update a set with **another set**. To check how many elements the set contains, we can use the function, `len()`.

In [139]:
set_a.add("Watermelons")
set_a

{'Apples', 'Bananas', 'Oranges', 'Pears', 'Watermelons'}

In [140]:
set_a.add("Pears")
set_a

{'Apples', 'Bananas', 'Oranges', 'Pears', 'Watermelons'}

In [143]:
# Note that the elements in the set may not retain their index
set_b = {"Mangoes", "Durians"}
set_a.update(set_b)
set_a

{'Apples', 'Bananas', 'Durians', 'Mangoes', 'Oranges', 'Pears', 'Watermelons'}

In [144]:
len(set_a)

7

What if we were to add an element that is already present in the set?

In [145]:
set_a.add("Durians")
set_a

{'Apples', 'Bananas', 'Durians', 'Mangoes', 'Oranges', 'Pears', 'Watermelons'}

In [146]:
len(set_a)

7

As it turns out, adding the string, "Durians" does not change the set in any way, since the string **already** appears in the set. This is one of the key differences between sets and lists. Another difference (as we have previously discussed) is that sets may not retain their structure as they are constructed, but lists do.

---

To remove an element from the set, we can use the set method, "remove". However, if an element does not exist in the set, this will raise an error. 

For that reason, some prefer to use the set method, "discard". Below, we show both methods.

In [147]:
set_a

{'Apples', 'Bananas', 'Durians', 'Mangoes', 'Oranges', 'Pears', 'Watermelons'}

In [148]:
set_a.remove("Jackfruits")

KeyError: 'Jackfruits'

Since the element "Jackfruits" does not appear in the set, this raises a KeyError. What does the "KeyError" here refer to? In Python, "keys" and "elements" in a set are used interchangeably - there is an error with regard to the removing the element from the list.

In [149]:
set_a.remove("Mangoes")

In [152]:
set_a

{'Apples', 'Bananas', 'Durians', 'Oranges', 'Pears', 'Watermelons'}

In [154]:
set_a.discard("Durians")
set_a

{'Apples', 'Bananas', 'Oranges', 'Pears', 'Watermelons'}

To learn more about sets, you can refer to the link [here](https://www.w3schools.com/python/python_sets.asp).

### Dictionaries

A dictionary is a collection of variables that have the following properties:
1. They are **unordered**
2. They are **changeable**
3. They are **indexed** 

In Python dictionaries are also written with curly brackets _{ }_, but they contain 2 types of elements: keys and values. 

Dictionaries are perhaps the most useful of all data structures, since we are able store 2 types of information with one data structure. 

For example, a teacher may be keen to store the test results of students. Then, the keys are the names of the students, while the values are their test results.

In [168]:
# We begin by defining a dictionary: "height", that contains 6 key-value pairs 
# where keys are names, and values are height

height = {
    "John": 175,
    "Marie": 165,
    "Jack": 190,
    "Stacy": 177,
    "Jackson": 168,
    'Dana': 180
}

height

{'Dana': 180,
 'Jack': 190,
 'Jackson': 168,
 'John': 175,
 'Marie': 165,
 'Stacy': 177}

In [169]:
# To access the height (value) of a certain individual, we can use the key (name of the individual)
height['Dana']

180

In [170]:
# To check if a name is in the dictionary, we can use the "in" function
'Stacy' in height.keys()

True

In [171]:
175 in height.values()

True

In [172]:
# To check the names of the dictionary, we can use the "keys" method
list(height.keys())

['John', 'Marie', 'Jack', 'Stacy', 'Jackson', 'Dana']

In [173]:
# Similarly, we can do the same for values using the "values" method
height.values()

dict_values([175, 165, 190, 177, 168, 180])

In the previous 2 code blocks, they returned an object `dict_values`. One can coerce these objects to be lists, by using the function, `list()`

In [174]:
set(height.keys())

{'Dana', 'Jack', 'Jackson', 'John', 'Marie', 'Stacy'}

In [175]:
# We can also change the values of a specific key. For example, suppose Dana grew taller by 5cm this year.
height['Dana']

180

In [176]:
height['Dana'] = 185
# height['Dana'] += 5
height['Dana']

185

In [177]:
# In addition, we can add new keys to the dictionary
height["Luke"] = 180
height

{'Dana': 185,
 'Jack': 190,
 'Jackson': 168,
 'John': 175,
 'Luke': 180,
 'Marie': 165,
 'Stacy': 177}

In [178]:
len(height)

7

In [179]:
# To remove a key-value pair from the dictionary, we can use the "pop" method
height.pop("Jackson")
height

{'Dana': 185,
 'Jack': 190,
 'John': 175,
 'Luke': 180,
 'Marie': 165,
 'Stacy': 177}

In [180]:
height.pop("Lucy")

KeyError: 'Lucy'

In [133]:
# Alternatively, one can use the del (which stands for delete) function
del height['Jack']
height

{'Dana': 157, 'Jackson': 168, 'John': 175, 'Marie': 165, 'Stacy': 177}

In [134]:
del height["Lucy"]

KeyError: 'Lucy'

In-class assignment:

1. Create a dictionary, called "weight", using the follow dataset:

|Name  | Weight| Gender |
|------|-------|--------|
|Jane  | 46.0  | Female |
|John  | 75.2  |  Male  |
|Tina  | 50.2  | Female |
|Lena  | 48.5  | Female |
|Kane  | 78.2  |  Male  |
|Ryan  | 69.7  |  Male  |

2. Suppose data for a new person (Kate) is available, and she weighs 43.0 kg. Add this information in.
3. What is the average weight for the males in the group? (Hint: the mean function will be useful: mean( ))
4. What about the average weight for the females?

Suppose information on the height of the individuals is given:

|Name  | Height | Gender |
|------|------- |--------|
|Jane  | 155.0  | Female |
|John  | 181.2  |  Male  |
|Tina  | 172.6  | Female |
|Lena  | 162.3  | Female |
|Kane  | 174.8  |  Male  |
|Ryan  | 172.3  |  Male  |
|Kate  | 151.8  | Female |

1. Create another dictionary, "height" using the above dataset. Note that with the inclusion of this dataset, we have information on the height and weight of the 7 individuals.
2. Create a new list, "data" which contains 2 dictionaries (height and weight). For each dictionary, keys are the names and values are the corresponding heights and weights.

In [182]:
# Please do your in-class assignment here
weight = { 
    'Jane': 46.0, 
    'John': 75.2,
    'Tina': 50.2,
    'Lena': 48.5,
    'Kane': 78.2,
    'Ryan': 69.7
}

weight['Kate'] = 43.0

# Average weight of males
avgmaleweight = (weight['John'] + weight['Kane'] + weight['Ryan'])/3
avgfemaleweight = (weight['Jane'] + weight['Tina'] + weight['Lena'])/3

print('Average weight of males is', avgmaleweight)
print('Average weight of females is', avgfemaleweight)

height = {
    'Jane': 155.0,
    'John': 181.2,
    'Tina': 172.6,
    'Lena': 162.3,
    'Kane': 174.8,
    'Ryan': 172.3,
    'Kate': 151.8
}

data = [weight, height]
data

Average weight of males is 74.36666666666667
Average weight of females is 48.23333333333333


[{'Jane': 46.0,
  'John': 75.2,
  'Kane': 78.2,
  'Kate': 43.0,
  'Lena': 48.5,
  'Ryan': 69.7,
  'Tina': 50.2},
 {'Jane': 155.0,
  'John': 181.2,
  'Kane': 174.8,
  'Kate': 151.8,
  'Lena': 162.3,
  'Ryan': 172.3,
  'Tina': 172.6}]

### Application of Data Structures (Prelude to the next module)
In this section of the class, I will show an example of why lists and dictionaries are so important and powerful in computing. Consider the following problem, where we have a specific text, and we need to find the number of times each word appears in the text (suppose we have 10,000 texts with 10,000 words each). This may be because we expect words that appear more frequently to be more important or provide more information about the text.

For the sake of simplicity, we focus on only 1 text, but it is entirely plausible to have 10,000 texts to analyse. Our approach here is to break the problem down into smaller problems which are easier to solve, and then combine the solutions to the smaller problems and check if we can find a "global" solution.

Here, you will also see a glimpse of function definition and for-loops (both to be studied in the second module).

Let's proceed to work on converting our letters to lower-case, and remove punctuations.

In [2]:
# Read and save text data to variable (text)
f = open("text/intro.txt", "r")
text = f.read()
print(text)
f.close()


Why should you learn to write programs?

Writing programs (or programming) is a very creative 
and rewarding activity.  You can write programs for 
many reasons, ranging from making your living to solving
a difficult data analysis problem to having fun to helping
someone else solve a problem.  This book assumes that 
everyone needs to know how to program, and that once 
you know how to program you will figure out what you want 
to do with your newfound skills.  

We are surrounded in our daily lives with computers ranging 
from laptops to cell phones.  We can think of these computers
as our personal assistants who can take care of many things
on our behalf.  The hardware in our current-day computers 
is essentially built to continuously ask us the question, 
What would you like me to do next?

Programmers add an operating system and a set of applications
to the hardware and we end up with a Personal Digital
Assistant that is quite helpful and capable of helping
us do many different th

In [29]:
# Data cleaning using a function
def data_cleaning(text):
    '''
    This function strips input of whitespaces, converts it into lower case and removes all punctuation from input.
    In addition, it returns the text as a list by splitting on "spaces".
    
    Input: str
    Output: list
    '''
    text = text.strip() # Remove whitespaces from the text (white spaces are \n, \t)
    text = text.lower() # Converts the text to lower-case

    # Remove punctuations from text and save to the newtext variable
    punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~''' # provide the punctuations
    newtext = ""
    for char in text: # iterate over each character in the text and if it is a punctuation, remove it.
        if char in punctuations: continue
        else: newtext += char
    
    text_list = newtext.split() # converts the text to a list by splitting on the space between 2 words
    return text_list

text_list = data_cleaning(text)

In [30]:
# The output from the previous function returns a list of strings that are all in lower-case
text_list[:10]

['why',
 'should',
 'you',
 'learn',
 'to',
 'write',
 'programs',
 'writing',
 'programs',
 'or']

In [32]:
# Create a dictionary with words as key and values as the number of times the word appears in the list
word_count = {}

for word in text_list: # Loop through the words in the list, and keep count of each word by creating a dictionary
    # If appear before, add 1
    if word in word_count.keys(): word_count[word] += 1
    # Else initiate with value 1
    else: word_count[word] = 1

In [36]:
# Get top 50 words that appears in the text
sorted_word_count[:50]

[('the', 249),
 ('to', 205),
 ('a', 172),
 ('and', 164),
 ('you', 152),
 ('is', 114),
 ('of', 103),
 ('python', 103),
 ('in', 81),
 ('it', 76),
 ('that', 73),
 ('we', 67),
 ('for', 52),
 ('are', 43),
 ('be', 42),
 ('program', 41),
 ('language', 41),
 ('your', 39),
 ('with', 37),
 ('will', 34),
 ('as', 31),
 ('at', 31),
 ('or', 29),
 ('have', 27),
 ('programs', 26),
 ('our', 25),
 ('when', 25),
 ('this', 24),
 ('but', 24),
 ('write', 23),
 ('on', 23),
 ('can', 22),
 ('what', 21),
 ('computer', 21),
 ('very', 20),
 ('how', 20),
 ('programming', 19),
 ('these', 19),
 ('use', 19),
 ('words', 18),
 ('not', 18),
 ('more', 18),
 ('from', 17),
 ('so', 17),
 ('machine', 17),
 ('need', 16),
 ('problem', 15),
 ('do', 15),
 ('if', 15),
 ('word', 15)]

In [20]:
import wordcloud
import matplotlib.pyplot as plt

wc = wordcloud.WordCloud(text)

plt.show()

## Conclusion

At the end of this lesson, we have discussed the following:

1. Different types of data types
2. Different types of data structures
3. How to use these data types and structures to store data we are interested in