# Workshop Text Classification with Python

## Python Basics

Lara Kobilke

Last edit: 29.09.2020

# Python Core Language

This part teaches you the main variable types in Python. Afterwards you will be able to perform basic mathematical operations. The `x` stands for the input. Sometimes methods have `()` without `x` inside indicating that no input is needed. 

| Variable Type | Method | Example |
|---------------|--------|---------|
|Integers (whole numbers)| int(x) | -1 , 3 , 3229732  |
|Floating point numbers	 | float(x)	| 1.424 , 1.34e-4 =1.34*10-4 | 
|Logical (Boolean) values|bool(x)   |True or 1, False or 0|
|String	| str(“x”)|“letters and symbols!”|

Unlike some other programming languages, Python does not require you to specify the data type when defining a variable. However, you can do this with the definition command. You can check the type of a variable with the `type()` command.

_Hint:_ On Windows, you can run lines by clicking on them and hitting `Cmd (German: Strg) + Enter`. On a Mac, use the `Apple Key + Enter`.

In [None]:
x = -1
x

In [None]:
type(x)

In [None]:
x = 1.424
x

In [None]:
type(x)

In [None]:
x = True
x

In [None]:
type(x)

In [None]:
x = bool(0)
x

In [None]:
x = 'True' # Notice: You must use '' or "" to create a string and not a bool!
x

In [None]:
type(x)

### Mathematical Operators

|Operator	| Python Symbol |	Example |
|-----------|---------------|-----------|
|Addition|	+	|Input: 5+2<br>Output: 7|
|Subtraction|	-	|5-2<br>3|
|Multiplication	|*	|5*2<br>10|
|Float Division	|/	|5/2<br>2.5|
|Integer Division	|// |5//2<br>2|
|Exponentiate	|** |	5**2<br>25|
|Modulo	|%|	5%2<br>1|


__Example:__

In [None]:
a = 23
b = 4
c = a + b

In [None]:
c

In [None]:
type(a)

In [None]:
type(b)

In [None]:
type(c)

### Assigning Variables in Python

Python allows you to reassign a variable with itself:  

In [None]:
b = b + 4
print(b)

In [None]:
b += 2
print(b)

Variable names are case sensitive and can contain any ASCII letter, the underscore (_) or any number. Numbers cannot be put at the beginning and operators cannot be part of the variable name. Furthermore, some words are reserved keywords for python and cannot be variable names as well:

![grafik.png](attachment:grafik.png)

### Comparison and Logical Operators

|Operator|	Python Symbol|	Example|
|--|--|--|
|Equal to |	==	|Input: 5 == 3 <br> Output: False<br><br> Input: “abc” == “abc” <br> Output: True|
|Not equal to	|!=	|5 != 3<br>True|
|Greater than|	>	|5 > 3<br>True|
|Less than|	<	|5 < 3 <br>False|
|Greater than or equal to|	>=	|5 >= 3 <br>True|
|Lesser than or equal to|	<=	|5 <= 3 <br>False|
|And operator	|and|	5 > 3 and -1 < 0<br>True|
|Or operator|	or	|5 > 3 or -1 > 0<br>False|
|Not operator|	not|	not (5 < 6 or 5 == 5)<br>False|


## Lists and Dictionaries

Python has four types of Collections with unique features:
  * `List`: a collection of data entries which is ordered and changeable. Lists allow duplicate members.
  * `Tuple`: a collection of data entries which is ordered and unchangeable. Tuples allow duplicate members.
  * `Set`: a collection of data entries which is unordered and unindexed. Sets do not allow duplicate members.
  * `Dictionary`: a collection of data which is unordered, changeable and indexed. Dictionaries allow no duplicate members.
  
In this tutorial, you will learn how to work with `lists` and `dictionaries`.

__Lists:__

In Python lists are written with square brackets `[]`.

In [None]:
a = list([1,2,3,4,5]) # You can create lists using the list() function
a

In [None]:
a = [1,2,3,4,5] # Alternatively, you can just use []
a

In [None]:
a[2] = 10 # You can change list entries by referring to their index, starting at 0
a

__Dictionaries:__

In Python dictionaries are written with curly brackets `{}`, and they have keys and values.

In [None]:
population_dict = {'Zurich': 402.762,
                   'Bern': 1035.000,
                   'Luzern': 81.592,
                   'Geneva': 499.480,
                   'Basel': 171.017}
population_dict

In [None]:
population_dict['Zurich'] # You can access the items of a dictionary by referring to its key name, inside square brackets.

In [None]:
population_dict['Zurich'] = 2000.000  # You can change dict entries by referring to their key name
population_dict

## Memory Management 

When an object is created, like `x = 3`, it is assigned a specific location in the memory of a PC. The location is called identity and can be accessed with the function `id()`. Python distinguishes between immutable and mutable objects.

In [None]:
a = 3
id(a)

In [None]:
a = a +4 
id(a)

While both times `a` was used as variable, the id in the memory is different. That means that `a` is an immutable object because it is not overwritten but assigned to another memory location.

In [None]:
a = list([1,2,3,4,5])
a

In [None]:
id(a)

In [None]:
a[2] = 10
a

In [None]:
id(a)

As you can see, lists are matuable objects. Their `id` will not change.

This is also true for dictionaries. Their `id` will not change as well:

In [None]:
a = {1: 'Anna', 2: 'Basti', 3: 'Carmen'}
print(a)
print(id(a))
a[3] = 'Carsten'
print(a)
print(id(a))

You can always reclaim memory by deleting variables:

In [None]:
del a
a

If you are really in need, you can also free memory by importing the `gc` ("garbage collector") package:

In [None]:
import gc # You can write the import command at the top of your code. You won't need to call it again.
gc.collect()

## Working with String Objects

### String Operations

A Python `String Object` is an ordered, immutable sequence of characters. Strings are enclosed by single or double quotes:

In [None]:
var1 = "This is a string"
var2 = 'and this is also one'

Strings can be concatenated using the + operator:

In [None]:
var3 = var1 + var2
print(var3)

The strings are just added together without space. It is possible to add space by concatenating “ “: 

In [None]:
var4 = var1 + " " + var2
print(var4)

Strings can be repeated n times using the * operator:

In [None]:
my_abc = 'abc'
my_abc*3

Some symbols in strings are special operators called escape sequences. They are initiated by `\`. For example: `\n` creates a new line and `\t` enters a tabulator space:

In [None]:
var5 = 'this text\nhas two lines'
print(var5)

Mind that there is no space before and after the `\n`.

### Common String Methods

Here is a list of common methods you can use to modify strings. `x` represents a string object, like `x = “test”`. String methods do not replace the original string. 

|Description	|Method|	Example|
|-|-|-|
|Length of a string|	len(x) 	|Input: len(“test 1”)<br> Output: 6|
|Remove leading and tailing whitespaces|	x.strip()|	“ test “.strip()<br>“test”|
|Return copy of string with all characters in uppercase	|x.upper()|	‘test’.upper()<br>‘TEST’|
|Return copy of string with all characters in lowercase|	x.lower()	|‘TeSt’.lower()<br>‘test’|
|Return list of substrings from original string which are separated by the string separator|	x.split(separator)|	See example below|
|Return copy of string with substring **old** replaced by **new**|	x.replace(old, new)|	“SpS”.replace(“S”, “p”)<br>“ppp”|
|in as a logical operator for strings to find a series of characters in a string	|“abc” in x	|“abc” in “1abcd”<br>True|


In [None]:
"Ein grosser Baum".split(" ")

More methods for strings can be found here: https://www.w3schools.com/python/python_ref_string.asp

### Indexing and Slicing

A string consists of n characters and is indexed from 0 to n-1:

|Index|0|1|2|3|
|-|-|-|-|-|
|String|I|K|M|Z|
|Position|1|2|3|4|
|Reverse index|	-4|	-3|	-2|	-1|

You can return a character in a string by referring to its index and using [ ]:

In [None]:
a = 'IKMZ'
print(len(a))
print(a[0])
print(a[len(a)-1])
print(a[-1])

The method len() gives you the total length of the string with the first character being 1. Therefore, “IKMZ” has a length of 4. To refer to the first character, “I”,  in that string you have to refer to its index: 0. For referring to the last character of a string, the len() method can be used. But because len() starts counting from 1, the result of len() needs to be decreased by one to access the index of the last character. Another way to refer to the last character in a string is to use the reverse index.

Slicing produces a substring of a given string between the characters of two indices, including index 𝒊 and excluding index 𝒋: a[i:j]. So when a = IKMZ, then:

In [None]:
a[1:3]

In [None]:
a[:2]

In [None]:
a[1:]

With omitting the i or j index, you can in include all characters before or after the given index.







# Using Pandas

This part teaches you to use the Pandas package in Python. Afterwards you will be able to load data from `csv-files`, to set your working directory and to perform aggregated data analysis. The Pandas package behaves very similar to R, e.g. it allows to work with data frames, to attach labels to data, to work with missing data, etc.

## Load Pandas Package

_Remember:_ You should always load the package first, otherwise the code will provoke an error message. You can write the `import` command in the first line of your code because you don't need to run it a second time.

In [None]:
import pandas as pd

## Introduction to `Pandas Objects`

Pandas offers two object types:
* `Series` object: A one-dimensional array of indexed data
* `DataFrame` object: A two-dimensional array with both flexible row indices and flexible column names

These objects behave similar to dataframes in R: The rows and columns are identified with labels rather than simple integer indices.



## The Pandas `Series` Object

A Pandas `Series` object is a one-dimensional array of indexed data. By deafult a `Series` object is indexed with integer indices.

In [None]:
data1 = pd.Series([50,100,200])
data1

But you can also _explicitly_ define an index for a `Series` object:

In [None]:
data2 = pd.Series([50,100,200], index=["a","b","c"])
data2

You can access the index labels by using the `index` attribute:

In [None]:
data2ind = data2.index
data2ind

You can also create a `Series` object directly from a Python dictionary:

In [None]:
data_dict = pd.Series({"c":123,"a":30,"b":100})
data_dict

Notice: The index for the `Series` is created from the sorted keys!

## The Pandas `DataFrame` Object

A `DataFrame` object is a two-dimensional array with flexible row indices and flexible column names.
 * Both the rows and columns have a generalized index for accessing the data.
 * The row indices can be accessed by using the `index` attribute.
 * The column indices can be accessed by using the `columns` attribute.
 

### Constructing `DataFrame` Objects

You can think of a `DataFrame` as a sequence of aligned `Series` objects, meaning that each column of a `DataFrame` is a `Series`.

__Introductory Example:__

You should first create two dictionaries from which you draw two `Series` Objects:

In [None]:
population_dict = {'Zurich': 402.762,
                   'Bern': 1035.000,
                   'Luzern': 81.592,
                   'Geneva': 499.480,
                   'Basel': 171.017}
population_series = pd.Series(population_dict)

area_dict = {'Zurich': 87.88, 'Bern': 51.62, 'Luzern': 29.11, 'Geneva': 15.92, 'Basel': 23.85}
area_series = pd.Series(area_dict)

Now you can look at the two `Series` Objects and their values:

In [None]:
population_series

In [None]:
area_series

Finally, you can create a `DataFrame` Object from these two `Series` Objects:

In [None]:
swiss_states = pd.DataFrame({'population': population_series,'area': area_series})
swiss_states

This final product is your `DataFrame` with labeled rows and columns!

You can now look at the labels of rows and columns seperately:

In [None]:
swiss_states.index

In [None]:
swiss_states.columns

_Notice:_ There are multiple ways to construct a `DataFrame` object.
 1. From a single `Series` object:

In [None]:
pd.DataFrame(population_series, columns=["population"])

    2. From a list of dictionaries:

In [None]:
pd.DataFrame([{'Jonas': 1.0, 'Werner': 2.0, 'Hannah': 1.0}, {'Werner': 3.0, 'Lisa': 2.0, 'Hannah': 3.0}])

    3. From a dictionary of `Series` objects:

In [None]:
pd.DataFrame({'population': population_series, 'area': area_series})

## Data Selection in `Series`

`Series` can be used as a dictionary:
 * Select elements by key, e.g. `data['Werner']`
 * Modify the `Series` object with familiar syntax, e.g. `data['Lisa'] = 100`
 * Check if a key exists by using the `in` operator
 * Access all the keys by using the `keys()` method
 * Access all the values by using the `items()` method

In [None]:
grades = pd.Series([1.0, 2.0, 1.0], index=['Jonas', 'Werner', 'Hannah'])
grades

In [None]:
grades['Werner']

In [None]:
grades['Hannah'] = 3.0
grades

In [None]:
'Hannah' in grades

In [None]:
grades.keys()

In [None]:
list(grades.items())

`Series` can also be used as one-dimensional array: 
 * You can select elements by using their implicit integer index, e.g. `grades[0]`
 * You can select elements by using their explicit index, e.g. `grades['Hannah']`
 * You can select slices both by using an implicit integer index or an explicit index
   * _Notice:_ Slicing with an explicit index (e.g., `grades['Jonas':'Werner']`) will _include_ the final index in the slice, while slicing with an implicit index (e.g., `grades[0:1]`) will _exclude_ the final index from the slice
 * You can use masking operations, e.g., `grades[grades < 2]`
   * _Notice:_ You can combine masking operations, e.g. grades[(grades > 1) & (grades < 3)]

In [None]:
grades[0] # Selecting by implicit index

In [None]:
grades['Hannah'] # Slicing by explicit index

In [None]:
grades[0:1] # Slicing by implicit index

In [None]:
grades['Jonas':'Werner'] # Slicing by explicit index

In [None]:
grades[grades < 2]

In [None]:
grades[(grades > 1) & (grades < 3)]

## Data Selection in `DataFrame`

A `DataFrame` can be used as a dictionary of related `Series` objects: 
 * Select `Series` by the column name, e.g. `data['area']`
 * Modify the `DataFrame` object with mathematical operators, e.g. `data['density'] = data['population'] / data['area']`

In [None]:
population_series = pd.Series({'Zurich': 402.762,
                   'Bern': 1035.000,
                   'Luzern': 81.592,
                   'Geneva': 499.480,
                   'Basel': 171.017})

area_series = pd.Series({'Zurich': 87.88, 'Bern': 51.62, 'Luzern': 29.11, 'Geneva': 15.92, 'Basel': 23.85})

data = pd.DataFrame({'area': area_series, 'population': population_series})
data

In [None]:
data['area']

In [None]:
data['population']

In [None]:
data['density'] = data['population'] / data['area']
data

A `DataFrame` can be used as a two-dimensional array: 
 * Access the underlying matrix by using the `values` attribute
   * `df.values[0]` will select the first row
 * Use the `iloc` indexer to index, slice, and modify the data by using the implicit integer index
 * Use the `loc` indexer to index, slice, and modify the data by using the explicit index

In [None]:
data.values

In [None]:
data.values[0]

In [None]:
data.iloc[:3, :2] # Use implicit, ordinal indices

In [None]:
data.loc[:'Luzern', :'population'] # Use explicit, labeled indices

In [None]:
data.iloc[0, 2] = 120
data

## Missing Data

In Python missing data is marked with `NaN` ("Not a Number"). You can create NaN values by loading the `Numpy` Package and using the `np.nan` function. You can use the `dropna()` method to drop missing values. And you can also define whether you want to drop the rows or the columns by using the axis parameter.

In [None]:
import numpy as np # Loading the Numpy package
data.iloc[0, 2] = np.nan # Using the np.nan function
data

In [None]:
data.dropna(axis='rows') # Drop the rows where at least one element is missing

In [None]:
data.dropna(axis='columns') # Drop the columns where at least one element is missing

## Reading and Writing Data with Pandas
 
### The Current Working Directory

Every program that runs on your computer has a _current working directory_ . It's the directory (more modern: _folder_ ) from where the program is executed.
The _root_ directory is the top-most directory and is addressed by `/` 
 * A directory `mydir1` in the root directory can be addressed by `/mydir1`
 * A directory `mydir2` within the `mydir1` directory can be address by `/mydir/mydir2`, and so on
 
### Absolute and Relative Paths

An _absolute path_ begins always with the root folder, e.g. `/my/path/...`. A _relative path_ is always relative to the program's current working directory. 

__Example:__ 
If a program's current working directory is `/myprogram` and the directory contains a folder files with a file `test.txt`, then the relative path to that file is just `files/test.txt` The absolute path to `test.txt` would be `/myprogram/files/test.txt` (note the root folder `/`)
 
### Set and Change your Working Directory

You can see, set, and change your Working Directory:
  * pwd: Print the working directory (where we currently are in the file system)
  * ls: List working directory contents
  * import os // os.chdir("C:/path/to/your/location"): Change directory
  * mkdir: Make new directory

In [None]:
pwd

In [None]:
# This command changes your working directory.
# You will need to remove the #-sign in the next lines of code, put your own file path in the "C:/path/to/your/location", 
# and then run the following code:

# import os
# os.chdir("C:/path/to/your/location")

In [None]:
mkdir test

In [None]:
ls

### Reading Data with Pandas

Pandas provides the `pandas.read_csv()` function to load data from a CSV file. 

Download the CSV file `Tweets_by_Sentiment_DataFrame` from the link: https://drive.switch.ch/index.php/s/r932ztOQdtqBX57,       Password: _css_methods_

Store the data set in the same folder as this Notebook.

In [None]:
tweets = pd.read_csv("Tweets_by_Sentiment_DataFrame.csv", sep=",")

In [None]:
# If the file is not in the same folder as this Notebook and you get a FileNotFoundError, your directory might not 
# be set up correctly.
# In this case, you should manually set the working directory to the file in which the csv is stored.
# You will need to remove the #-sign in the next lines of code, put your own file path in the "C:/path/to/your/location", 
# and then run the following code:

# import os
# os.chdir("C:/path/to/your/location")

In [None]:
tweets.shape
# Number of rows/cases and columns

In [None]:
len(tweets)
# Number of rows/cases

In [None]:
len(tweets.columns)
# Number of columns

In [None]:
tweets.columns
# Gives the names of the columns

In [None]:
tweets.head()
# Gives the head (5 rows in default) of the data frame.

In [None]:
tweets.isnull().sum()
# Shows the count of missings in each column

## Aggregating and Grouping Data in Pandas

### Simple Aggregation in Pandas

While for a `Series` the aggregates return a single value, for a `DataFrame` the aggregates return results for each column

__Series:__

In [None]:
area.sum()

In [None]:
area.mean()

In [None]:
area.max()

In [None]:
area.min()

__DataFrame:__

In [None]:
data.sum()

In [None]:
data.mean()

In [None]:
data.max()

In [None]:
data.min()

In [None]:
data.describe()

### Append & Concatenate DataFrames

You can create a single `DataFrame` from multiple `DataFrames` by using `Append` and `Concat`
    * The functions `Append` and `Concat` are almost equivalent
    * `Concat` provides the flexibility to join based on the axis (all rows or all columns)
    * `Append` is the specific case(axis=0, join='outer') of `Concat`

__Let's first define three DataFrames:__

In [None]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']},
                    index=[0, 1, 2, 3])
df1

In [None]:
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']},
                   index=[4, 5, 6, 7])
df2

In [None]:
df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                    'B': ['B8', 'B9', 'B10', 'B11'],
                    'C': ['C8', 'C9', 'C10', 'C11'],
                    'D': ['D8', 'D9', 'D10', 'D11']},
                    index=[8, 9, 10, 11])
df3

__Append them:__

In [None]:
df_all = df1.append(df2)
df_all

In [None]:
df_all = df1.append([df2, df3])
df_all

__Concatenate them:__

In [None]:
data_all = pd.concat([df1, df2])
data_all

In [None]:
data_all = pd.concat([df1, df2, df3])
data_all

* You can also define the axis you want to concatenate to:

In [None]:
df_all = pd.concat([df1, df2, df3], axis=0, join='outer')
df_all

In [None]:
df_all = pd.concat([df1, df2, df3], axis=1, join='outer')
df_all

### The `GroupBy` Object
* The `groupBy()` method returns a `DataFrameGroupBy`: It's a special view of the `DataFrame`
 * Helps get information about the groups, but does no actual computation until the aggregation is applied ("lazy evaluation", i.e. evaluate only when needed)
 * Apply an aggregate to this `DataFrameGroupBy` object: This will perform the appropriate apply/combine steps to produce the desired result
 * Other important operations made available by a `GroupBy` are _filter_ , _transform_ , and _apply_

In [None]:
names = pd.DataFrame({'name': ['Anna, Jörkel', 'Basti, Hörhammer', 'Caren, Liebeskind', 'Anna, Jörkel', 'Basti, Rittburger', 'Caren, Liebeskind'], 'values': range(1,7)})
names

In [None]:
groupby_key = names.groupby('name')
groupby_key.groups

In [None]:
names.groupby('name').sum()

### Column Indexing and Iterating Over Groups

The `GroupBy` object supports column indexing in the same way as the `DataFrame`, and returns a modified `GroupBy` object. The `GroupBy` object also supports direct iteration over the groups, returning each group as a Series or `DataFrame`.

In [None]:
names.groupby('name')

In [None]:
names.groupby('name')['values']

In [None]:
names.groupby('name')['values'].sum()

### Splitting the DataFrame by Groups

You can extract the Group `DataFrame` by first transforming it into a list:

In [None]:
grouped = list(names.groupby(['name']))
grouped

In [None]:
joerkel_dataframe = grouped[0][1] # DataFrame for first group
joerkel_dataframe

In [None]:
hoerhammer_dataframe = grouped[1][1] # DataFrame for second group
hoerhammer_dataframe

In [None]:
joerkel_dataframe['values'].sum()