# Environment reminder
* Load Big Data virtual machine inside the VirtualBox
* login : hduser, password : hduser
* Launch Jupyter from shell as "jupyter notebook"
* Type the python code with answers into the "code" cells and execute them with Ctrl-Enter
* If a cell has star instead its number "In [*]" , then it is being executed

# Introduction
The goals for this session are multiple:
* familiarize with the Jupyter computing environment, that is a widely used tool to develop and share data analysis and data visualization workflows. It supports several programming languages and is a relevant part of the pydata ecosystem. For more information: https://jupyter.org/
* understand how to launch basic Python commands and check their output on screen.
* touch on the most important Python built-in containers: lists, dictionaries and sets. Understand basic techniques to use them.
* basic text file handling in Python
* by combining all the techniques above you will be able to develop a simple text-processing application, 

# Exercise 1 - Print Function

**Create** a function `print_my_name` that prints 'Hello "your name"', where "your name" is passed as an input parameter. The function should raise a warning if parameter passed is empty.

For more documentation on formatting see Python 3 formatting at https://docs.python.org/3.1/library/string.html#format-examples.

# Exercise 2 - For Loop and String Functions

**Transform** the string `names` to a list with each element containing a name, by using the `split` function (if you don't remember how to use any function you can write `name_of_the_function?` to invoke the help.). Then, with a `for` loop, print "Hello + name" for each name in the list.


In [1]:
names = "Alice Bob Carol Eve Mallory Oscar"
#write your code here

# To Do

**Repeat** the exercise above, with the string `names2`. The output should be exactly the same as above.

In [14]:
names2 = "Alice;Bob.Carol;Eve.Mallory;Oscar"
#write your code here

# Exercise 3 - Join function and enumerate

**Convert**  "Alice Bob Carol Eve Mallory Oscar" to "Alice;Bob;Carol;Eve;Mallory;Oscar"

# To Do

**Still** for the same persons print the sequence number : Alice - 1, etc. Use the function `enumerate` and take care of the starting value for indexing.

# Exercice 4 - Using the dictionaries

**Create** `dict_names_num`, that is a `dict` data structure with the names in string `names` defined above as keys and a random number as value. We will use the built-in function `random.random` and `round`

In [3]:
import random
random.random()


0.6138024449281274

# To Do

**Try** to assign a list as a key to `dict_names_num`. Is the operation accepted? Can you describe the error?

**Compare** the order by which the items (an item as the couple key+value) are shown in the dictionary to the original 
one in the list. Do they match? Explain your reply.

# Exercise 5 - List Comprehension

**List** comprehension is a technique to apply the same operation to all elements in a container. In the following exercise we find the sqare root of each number in list `nums` : 1, 2,..., 10

In [13]:
from math import sqrt
nums = list(range(10))

#add your code here

The equivalent code without using the list comprehension would have been:

In [None]:
sqrt_elems = []
for elem in nums:
    sqrt_elems.append( sqrt(elem) )

Whenever possible list comprehension should be privileged as it offers a more readable, elegant and compact code.

# To Do

**By using** the built-in function `enumerate`, compute the sqrt only for elements with indexes in even positions (0,2).

Expected output:

[1.0488088481701516, 2.3021728866442674]


**By using** the built-in function `zip`, compute the square root of the sum of element at index i and element at index len(nums)-i, 

That is if

nums = [x1,x2,x3,x4]

Then:

result =[ sqrt(x1+x4) , sqrt(x2+x3) , sqrt(x3+x2) , sqrt(x4+x1) ]

-----
Expected output:
[2.2803508501982757, 2.7202941017470885, 2.7202941017470885, 2.2803508501982757]



**Get** the names and random values from the previous dictionary as two separate lists

**Remove** duplicates from the following list `list_1` : a,b,c,a,b,c,d,a,v,d,s,a,b

**Find** letters which are in `x` but not in `y`: a,b,v,f,g,d,a,z,a,e,a,e,q,a,f
(https://docs.python.org/3/library/stdtypes.html#set)

**Calculate** a list with logs, for a numeric list `1,2, ... 20` 
    https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions and https://docs.python.org/3/library/math.html#math.log)

**Exclude** from the list 0,2,3, ... 99 the items divisible by 3, Using list comprehension conditional form

# Working with files, strings etc

**Download** complete Shakespeare work from Gutenberg project pages(http://www.gutenberg.org/files/100/100-0.txt)


**Review** the code below

In [22]:
#Example of opening a file and reading lines
with open("100-0.txt",mode = "r",encoding="utf-8") as f:
    for i in range(3):
        print(f.readline())

﻿

Project Gutenberg’s The Complete Works of William Shakespeare, by

William Shakespeare



**Exercise** Count the number of lines in the file by reading line by line

**Package** the code to calculate the number of rows in a function: calc_nr_rows( file_nm , load_into_memory) where file_nm is the name of the file to be counted the rows on and load_into_memory a boolean parameter indicating whether the file can be loaded in memory (True) or not (False). Then benchmark the function on the file 100-0.txt using the magic function %timeit. Discuss the results. Hint: you can invoke the help of %timeit with %timeit?
        

With all the facilities that we have seen so far and others that you may want to explore on the official python website but also on blogs, StackOverflow, etc. we are able to perform some basic text processing
For example, we can calculate the number of words are in the text (a word is every string separated by whitespaces):

**Exercise** Split the lines into words and calculate how many words are in the text(separated by whitespaces)

**Define** a function that returns the index of the row with the largest number of words and the row itself. Test it on the file above.

**Define** a function that calculates the frequency of the words

While there exists a built-in implementation that does it in an efficient way (see https://docs.python.org/3/library/collections.html#collections.Counter) you'll develop your own implementation. 

There are several questions that you should ask yourself before writing the code 

* what is the expected output? 
* What is the best data structure in Python that fits my need?
* what are the input parameters of the function? More parameters mean more flexibility but also more complexity.
* what about the performance requirements? The function should offer a low memory footprint or the best execution time? 
* Can you compare the performance with collections.Counter?
* how do you deal with capital letters? Stopwords? Quoting? (eg. Chair, chair, chair. are different words?)
* how do you test the function?


**Exercise** Calculate the number of words containing "cert"

266

**Exercise** In the work entitled "ALLS WELL THAT ENDS WELL" separate the text of different Acts (or Scenes)

**Exercise** In work "ALLS WELL THAT ENDS WELL" parse the dialogues by the character (e.g. all phrases that LAFEU said)

**Exercise** Calculate TF/IDF for acts and scenes througout all Shakespeare works (no usage of special modules for a moment)
* use one of the previous exercises to split different works into small parts (acts, scenes etc)
* every such part is now considered as a separate document
* optional : remove stop words, like "the","a" etc (find a list on internet)
* calculate the TF/IDF
    * Term frequency (TF)= Number of times a term seen in document/total number of words in document
    * Inverse Document Frequency (IDF) = 1 + log(Total number of documents / Number of documents containing the term)
* compare the most important words for different documents
* how close are different works of Shakespeare if to judge by the words their parts use

### Exercise
from "Python for Software Design, Allen B. Dowley"

**Two words** are "rotate pairs" if you can rotate one of them and get the other.

Write a function that reads a wordlist and finds all the rotate pairs

# Working with encodings

* From Internet Dictionary Project ( http://www.june29.com/idp/IDPfiles.html ) download french, german, italian and portuguese dictionaries
* read these files into Python (check the encoding)
* replace "a\" by "à", "e/" by "é", "e\" by "è" etc
* create a translation function, like : `translateWord("go","french") # "aller" `
* Translate some story


## Pandas first steps

**Run** and analyse the following code example


In [10]:
import pandas as pd
import random

x = pd.Series([1,2,4])

# random generators for the next exercise
print (x)
print(random.randint(1,4))
random.choices(['red','green','other colour'],k=4)

0    1
1    2
2    4
dtype: int64
4


['red', 'green', 'red', 'green']

**Create** long (at least 10^3 elements) pandas Series of random integer numbers from 0 to 10 indexed by random colours extending the following code

**Calculate** average, min, max values

**Remove** duplicates of like "red"->1, "red"->1

**Count** occurrences of values and **find** unique values

**Replace** the index by integers from 0 to the Series length

**Calculate** sum, average, min etc values by colour  

**Create** a different copy similar to `x` series and concatenate them together

**Save** the series to a file

**Create** a multi index Series where the index is composed of colours and integers, and the values are random numbers from 0 to 1
https://pandas.pydata.org/pandas-docs/stable/advanced.html

**Find** sum by integer and colour indices

**Rewrite** Shakespeare TF/IDF calculations with Pandas using multi index for words and documents

In [None]:
#**