# Python - Data Structures and Functions
Some tasks are based on tasks on https://www.practicepython.org/.<br>
For better readability and encapsulation of our code, we can define functions, that perform specific instructions. We can call functions with inputs and functions may return an output, but don't have to. A function can be defined with the keyword *def* followed by the name of the function.

In [1]:
# Function without input and without output
def hello_world():
    print('Hello World!')
    
# Function with input, but without output
def print_my_input(my_input):
    print(my_input)

# Function with input and output
def add(value1, value2):
    return value1 + value2

Defined functions can be called by typing the function name:

In [2]:
hello_world()
print_my_input('Hello World!')
print(add(1, 3))

Hello World!
Hello World!
4


In python you can create list of items (arrays). We knew lists as the output of `string.split()` and from our previous exercise, but let's look deeper into them.

In [3]:
students = ["John", "Mary", "Ana", "Tim", "Bill"]
grades = [1, 1.7, 3.0, 1, "Not in class"]
print("These are my students:", students)
print("Here are their grades:", grades)

These are my students: ['John', 'Mary', 'Ana', 'Tim', 'Bill']
Here are their grades: [1, 1.7, 3.0, 1, 'Not in class']


Lists can be indexed analogue to strings

In [4]:
print("This is the first student:", students[0])
print("And this is their grade:", grades[0])

This is the first student: John
And this is their grade: 1


Lists in Python are like magic they may contain all kinds of objects together e.g. float, integers, strings even lists inside lists as we had in our previous exercise submission.

__Task 1__: Consider the following list of numbers.<br>
1. Iterate over the list and print each number less than 15.
2. Instead of iterating over the list, create another list and print the list as a whole.
3. Try to use only one line for the creation of the list in subtask 2. (Hint: List Comprehension)

In [5]:
list_of_numbers = [3, 7, 10, 23, 8, 15, 34, 12, 16, 5, 45, 63, 13, 9]

# Code
lst = [num for num in list_of_numbers if num < 15]
print(lst)

[3, 7, 10, 8, 12, 5, 13, 9]


__Task 2__: Write a function, that expects two lists (of possibly unequal lengths) as input and that returns the intersection of both lists without duplicates (Hint: *set()*). The result of the function executed on the lists [1, 2, 3] and [1, 3, 5, 6, 3] should be [1, 3].

In [6]:
def lst_intersection(l1, l2):
    return set(l1).intersection(set(l2))
print(lst_intersection([1,2,3], [1,3,5,6,3]))

{1, 3}


You can also create dictionaries (Map/HashMap in Java).
Using this structure you can link elements. We call them keys and values.
### Important!
Keys must be unique. Otherwise you will overwrite the value of that key

In [7]:
dictionary = {}
dictionary["a"] = 1
dictionary["b"] = 2
dictionary["c"] = 3
print(dictionary)

dictionary["a"] = [1, 2, 3, 5]
print(dictionary)

{'a': 1, 'b': 2, 'c': 3}
{'a': [1, 2, 3, 5], 'b': 2, 'c': 3}


Again, it is possible to mix different structure types into a dictionary.

In [8]:
my_dictionary = {}
my_dictionary["d"] = dictionary
my_dictionary["e"] = 5
my_dictionary["f"] = 6
print(my_dictionary)

{'d': {'a': [1, 2, 3, 5], 'b': 2, 'c': 3}, 'e': 5, 'f': 6}


In [9]:
print(my_dictionary["d"]["a"])

[1, 2, 3, 5]


In [10]:
my_number = my_dictionary["e"]
print(my_number)

5


__Task 3__: Consider the long word from practice class 1. Use a dictionary to store the number of characters in this word, i.e. output_dict['a'] should contain the number of characters 'a' in the word. Encapsulate your code in a function, so that you can use it again and on other words.

In [11]:
long_word = 'pneumonoultramicroscopicsilicovolcanoconiosis'
def word_len_dict(word):
    output_dict = {}
    output_dict["a"] = len(word)
    return output_dict
print(word_len_dict(long_word))

{'a': 45}


## Now is the turn for some Numpy insides
Numpy is a python library for mathematical purposes and with good support for vectors and matrices. A library can be loaded into your file using the *import* instruction. We can define an abbreviation for used libraries, so that we do not need to type the full name.

In [12]:
import numpy as np

a = np.arange(15)
print(a)

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14]


In [13]:
a = a.reshape(5,3)
print(a)

[[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]
 [12 13 14]]


__Task 4__: Define numpy arrays, containing 10 Zeros, 10 Ones and 10 Fives. (Note: There are more than one solutions)

In [14]:
b = np.zeros(10)
print(b)

c = np.ones(10)
print(c)

d = np.array([5, 5, 5, 5, 5, 5, 5, 5, 5 ,5])
print(d)

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[5 5 5 5 5 5 5 5 5 5]


### A Numpy array has different attributes like shape, size, dimensions, type

__Question__: What is actually the difference between them? Discuss this with your neighbor.

In [15]:
print("shape of a is:", a.shape)
print("size of a is:", a.size)
print("dimensions in a:", a.ndim)
print("type of a is:", a.dtype)

shape of a is: (5, 3)
size of a is: 15
dimensions in a: 2
type of a is: int64


We can also use the ordinary mathematical operations on vectors, i.e. Numpy-arrays.

In [16]:
a = np.array([10, 20, 30, 40])
b = np.arange(4)

In [17]:
print(a)
print(b)

[10 20 30 40]
[0 1 2 3]


In [18]:
print(a + b)

[10 21 32 43]


In [19]:
print(a - b)

[10 19 28 37]


__Question__: What is the difference between the following instructions? Discuss this with your neighbor.

In [20]:
print(a * 2)
print(a * b)
print(a.dot(b))

[20 40 60 80]
[  0  20  60 120]
200


In [21]:
c = np.zeros(4)
print(c)

[0. 0. 0. 0.]


__Task 5__: Define one function, that expects a numpy-matrix and results the sum of all elements in this matrix, the sum of each columns as well as the sum of each row (each as vector). The result for<br>
[[1, 2, 3],<br>
[4, 5, 6],<br>
[7, 8, 9]]<br>
should be 45, [12, 15, 18], [6, 15, 24]. (Hint: *np.sum()*)

In [22]:
def matrix_fun(arr1):
    return np.sum(arr1), np.sum(arr1, axis =0).tolist(), np.sum(arr1, axis=1).tolist()

print(matrix_fun(np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])))

(45, [12, 15, 18], [6, 15, 24])


### Imagine we have two different arrays and we want to compare them

__Question__: How could we compare two arrays? The following two arrays may help with your thoughts. Discuss this with your neighbor.

In [23]:
first_array = np.zeros(10)
first_array[2]= 1
first_array[4]= 1
first_array[8]= 1

second_array = np.ones(10)

print(first_array)
print(second_array)

[0. 0. 1. 0. 1. 0. 0. 0. 1. 0.]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]


After creating an array, we can still assign values to specific positions in the array:

In [24]:
first_array = first_array.reshape(2,5)
print(first_array)

first_array[1][4] = 5
print(first_array)

[[0. 0. 1. 0. 1.]
 [0. 0. 0. 1. 0.]]
[[0. 0. 1. 0. 1.]
 [0. 0. 0. 1. 5.]]


We can also assign whole columns:

In [25]:
first_array[:,3] = 3
print(first_array)


[[0. 0. 1. 3. 1.]
 [0. 0. 0. 3. 5.]]


__Task 6__: We want the first row of 'first_array' to contain '2'.

In [26]:
first_array[0] = 2
print(first_array)

[[2. 2. 2. 2. 2.]
 [0. 0. 0. 3. 5.]]


__Task 7__: Compute the following matrix multiplications: A \* B and B \* A. What is the result of using the ordinary multiplication operator and what do we have to change for using the multiplication operator? (Hint: *array.reshape()*) 

In [27]:
A = first_array
B = second_array
B = B.reshape(first_array.shape)
print(B*A)
print(A*B)

# Code

[[2. 2. 2. 2. 2.]
 [0. 0. 0. 3. 5.]]
[[2. 2. 2. 2. 2.]
 [0. 0. 0. 3. 5.]]


# Text Processing

Now let's go into how to extract information from a text. Your task in this section is to understand what is going on, as you will need that in the homework exercise.

First we will import some packages that we need for our task and here we go!

In [28]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

Normally models do not process text but a numerical version of it. Therefore we need to vectorize our data. There are several ways of doing it, you might remember that from our previous presentation. Here are a couple of them:

**Bag of words**: 
Let's imagine that we have four documents:

 1. I like to eat pizza every day
 2. I like to eat pasta every day
 3. I do not like pizza
 4. I do my homework every day
 
__Task 8__: We want to list all words, that our documents consists of (i.e. the vocabulary).
You can check the string function `split` and the method `sorted`

In [29]:
sents = [
    "I like to eat pizza every day",
    "I like to eat pasta every day",
    "I do not like pizza",
    "I do my homework every day"
]

bag_of_words = []
for sent in sents:
    bag_of_words += sent.split()

print(sorted(set(bag_of_words)))


['I', 'day', 'do', 'eat', 'every', 'homework', 'like', 'my', 'not', 'pasta', 'pizza', 'to']


Having the list, we can now write manually our bag of words for each document:

1. [1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1]
2. [1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1]
3. [1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0]
4. [1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1]

And tada!! This are the vectors that we would feed in our model.

In [30]:
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?']

In [31]:
count_vectorizer = CountVectorizer()
X = count_vectorizer.fit_transform(corpus)
print(count_vectorizer.get_feature_names())

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']


In [32]:
print(X.toarray())  

[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]


## Reading and Writing Files

There are several ways of processing files, the most common one is using the `open()` function.

In [33]:
my_file = open("yelp_polarity.txt","r")

line_counter = 0

for line in my_file:
    print(line)
    line_counter += 1
    if line_counter >= 5:
        break

my_file.close()


Wow... Loved this place.	1

Crust is not good.	0

Not tasty and the texture was just nasty.	0

Stopped by during the late May bank holiday off Rick Steve recommendation and loved it.	1

The selection on the menu was great and so were the prices.	1



**Note**: when handling very big text files, keeping the whole data in the buffer would be counterproductive and inefficient, therefore it is important to learn how to read one line at a time.

In [34]:
with open("yelp_polarity.txt") as my_file:
    line_counter = 0

    for line in my_file:
        print(line)
        line_counter += 1
        if line_counter >= 5:
            break
    

Wow... Loved this place.	1

Crust is not good.	0

Not tasty and the texture was just nasty.	0

Stopped by during the late May bank holiday off Rick Steve recommendation and loved it.	1

The selection on the menu was great and so were the prices.	1



If you have cvs format files, you can read them with Pandas, this is a useful way of doing analysis on your data.

In [35]:
dataframe = pd.read_csv("yelp_polarity.txt", sep="\t", header=None)

In [36]:
for i in range

Unnamed: 0,0,1
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1
...,...,...
995,I think food should have flavor and texture an...,0
996,Appetite instantly gone.,0
997,Overall I was not impressed and would not go b...,0
998,"The whole experience was underwhelming, and I ...",0


__Task 9__: Now it's your turn to explore data using Pandas. How many classes do we have in this file and how many samples per class? Remember that the question mark (?) can be always very useful when you don't know the functions. Check the function `groupby` and `describe`

In [50]:
# Code
print(dataframe.groupby(dataframe[1]).count())

     0
1     
0  500
1  500
