# Python for Psychologists - Session 3
## from Session2: sets  
## later: handling data with dataframes & pandas

### Sets
Sets are *unordered* and *unindexed* collections of things. Sets also differ from lists in that they cannot contain two identical elements. Sets can be created with the following syntax:

```python
my_set = {"element1", element2, 4}
```

Create a set of three random inputs.

In [65]:
my_set = {"hallo", 2, 9}
my_set

{'hallo', 2, 9}

Now try to create a set with a duplicate value. Take a look at the resulting set afterwards.

In [67]:
my_set = {"hallo", 2, 9, 9, 1}
my_set

{'hallo', 2, 9, 1}

Try to access the first element of your set.

In [54]:
my_set[0]

TypeError: 'set' object does not support indexing

We can also convert lists to sets and vice versa:

```python
my_list = [1,2,3,4]
my_set = set(my_list)
my_list = list(my_set)
```

Take the following list, convert it to a set and back to a list again.

In [59]:
my_list = ["ich", "mag", "sets"]

In [60]:
my_set = set(my_list)
my_set

{'ich', 'mag', 'sets'}

In [73]:
new_list = list(my_list)
new_list

['ich', 'mag', 'sets']

**Adding items to a set**

We can add single items to a set using the following syntax:

```python
my_set.add(new_item)
```

We can also add several items at once:

```python
my_set.update({new_item1, new_item2})
```

Try to add some items to my_set.

In [64]:
my_set.update({"und", "Listen", "auch"})
my_set

{'Listen', 'auch', 'ich', 'mag', 'sets', 'und'}

**Removing items from a set**

Removing items from a set works similarly:

```python
my_set.remove(item_to_be_removed)
```

Delete one of the items that you have just added.

In [147]:
my_set.remove('Listen')
my_set

{'auch', 'ich', 'mag', 'sets', 'und'}

### handling data 

In the last two sessions you learned about the basic principles, data types, variables and how to handle them ... but **most** of the time we do not just work with single list, tuples or whatsoever, but with a bunch of data arranged in logfiles, tables, .csv files ..


Today we learn about using **pandas** to ... 

![pandasUrl](https://media.giphy.com/media/fAaBpMgGuyf96/giphy.gif "pandas")



... well to actually handle our data. **Pandas** is your data's home and we can get familiar with our data by cleaning, transforming and analyzing it through pandas.

For getting started, we need to `import pandas as pd ` to use all its provided features. We use ***pd*** as an abbreviation, since we are a bit lazy here :) 

In [1]:
import pandas as pd 

Pandas has two core components, i.e., ***series* and *dataframes***. A series is basically a single column, whereas a dataframes is a "multi-dimensional table" made out of a collection of series. Both can contains different kind of data types - for now we will use integers ..

----------
**creating series**

to create a series with any `element`, we can use:

```python
s = pd.Series([element1, element2, element3], name="anynameyouwant")
```

Try now to create two series representing your two favorite fruits and 6 random integers and check one of them:


In [2]:
s1 = pd.Series([3,4,7,8,4,1], name="apples")

s2 = pd.Series([5,9,12,2,9,10], name="bananas")

s1

0    3
1    4
2    7
3    8
4    4
5    1
Name: apples, dtype: int64

As we can see, there is one column (as described above) containing the assigned values, but wait .. why is there another column? 

The first column contains the index, in our case we just used the pandas default, that starts again with 0 (remember why?). Consequently, we can again use ```series[1] ``` for indexing the 2 value (row) in our series. 

Try to index the last element in one of your fruit series and think about what´s different when we index e.g. lists!

In [3]:
s2[5]

10

----------
**create dataframes from scratch**

Usually in data analysis we somehow end up with a .csv file from our experiment, but firstly we will learn how to create dataframes from scratch. There are many different ways and this notebook is certainly not exhaustive:

- we can use a dictionary to combine our two fruit series s1 and s2 to get a dataframe "shoppinglist" by using the ```pd.Dataframe(some_data) ``` Dataframe Builder. Here each (key:value) corresponds to a column: 

In [4]:
fruits= {"apples" : s1, "bananas" : s2} # first we need to arrange our series in a dictionary 

shoppinglist = pd.DataFrame(fruits) # pd.Dataframe(data) conveniently builds a nice looking dataframe for us 

shoppinglist # show our shoppinglist 

Unnamed: 0,apples,bananas
0,3,5
1,4,9
2,7,12
3,8,2
4,4,9
5,1,10


- another way to combine two series to get a dataframe is ``` pd.concat([seriesA, seriesB]) ``` which concatenates your series. Let´s try to recreate the result displayed above:

In [5]:
pd.concat([s1,s2])

0     3
1     4
2     7
3     8
4     4
5     1
0     5
1     9
2    12
3     2
4     9
5    10
dtype: int64

Oops, something went wrong! Do you have an idea what happened? 



**KEEP IN MIND!** 
(pandas) functions do have a default setting, which might sometimes behave different than expected. 

Remember?  By checking ```pd.concat? ``` in a code cell we see, that the default option for concatenating two objects is along the axis=0, i.e. along the rows! However, we want to recreate the nice looking dataframe above, which means we need to concat the objects along the column axis (i.e., axis=1) and specify it respectively. Let's see whether this works:

In [6]:
shoppinglist = pd.concat([s1,s2], axis=1)
shoppinglist

Unnamed: 0,apples,bananas
0,3,5
1,4,9
2,7,12
3,8,2
4,4,9
5,1,10


Right now, we are still using the pandas default for our index (i.e., numbers). Let´s say, we want to use customer names as an index:

```python
dataframe.set_index([list_of_anything_with_equal_length_to_dataframe])
```

Let´s create a list of 6 customers and replace the current indices with this list to see how many fruits each of them is buying at the Wochenmarkt:

In [7]:
customer=["Victoria", "Rhonda", "Elli", "Rebecca", "Lucie", "Isa"]

In [8]:
shoppinglist = shoppinglist.set_index([customer])
shoppinglist

Unnamed: 0,apples,bananas
Victoria,3,5
Rhonda,4,9
Elli,7,12
Rebecca,8,2
Lucie,4,9
Isa,1,10


btw: if you want to check how long your dataframe is, just use ```len(dataframe)``` - pretty easy, huh?

**Adding columns and rows**

*Columns*

The Wochenmarkt is about to close and all our customers are thrilled by all the last-minute sale offers. All of them are about to buy some plums.

Again, many roads lead to Rome and we will just cover some of them:

- declare a pd.series that is to be converted into a column by just creating a new ``` pd.Series ``` with an equal length and use ``` dataframe["new_column_name"] = pd.Series ```


In [9]:
s3 = pd.Series([1,2,3,4,5,6], name="plums", index=customer) ## does not work if indeces do not correspond

shoppinglist["plums"]=s3

shoppinglist

Unnamed: 0,apples,bananas,plums
Victoria,3,5,1
Rhonda,4,9,2
Elli,7,12,3
Rebecca,8,2,4
Lucie,4,9,5
Isa,1,10,6


Since series also contain a column that contains our index (if we don´t define it, pandas will use its default!) the index needs to correspond to the index in our dataframe, otherwise we will create a new column with undefined values (i.e. **N**ot **a** **N**umber, NaN values)

- this also works with lists and might be a little bit more convenient ``` dataframe["new_column_name"] = [some_list_with_equal_length]``` since lists do not contain an index

Try to add a new column "lemon" with random values for each customer!

In [10]:
shoppinglist["lemon"] = [2,4,5,1,7,9]
shoppinglist

Unnamed: 0,apples,bananas,plums,lemon
Victoria,3,5,1,2
Rhonda,4,9,2,4
Elli,7,12,3,5
Rebecca,8,2,4,1
Lucie,4,9,5,7
Isa,1,10,6,9


- if you want more flexibility, you could also use ```dataframe.insert``` to add a list of values to a new column at a specific position just like this: 

```python
dataframe.insert(position, "column_name", [some_list], True) ## omitting TRUE would raise an error when your 
                                                             ## column name already exists in your dataframe
```

Try to add a new column "oranges" at the third position with any random integers for all our customers!

In [11]:
shoppinglist.insert(2, "oranges", [1,2,3,4,5,6], True)
shoppinglist

Unnamed: 0,apples,bananas,oranges,plums,lemon
Victoria,3,5,1,1,2
Rhonda,4,9,2,2,4
Elli,7,12,3,3,5
Rebecca,8,2,4,4,1
Lucie,4,9,5,5,7
Isa,1,10,6,6,9


**adding rows**

Oh hey there, we just met Norbert, who is currently doing a smoothie-detox treatment and do you know what? He also likes apples, bananas, oranges, plums and lemons a lot! Let´s add him to our little dataframe!

Again, we can use ```pd.DataFrame```to create a new, single-row dataframe for norbert, that contains values for each of our fruits. To combine our two dataframes, our column names in both dataframes need to be identical! 

```python

new_dataframe = pd.DataFrame([some_list_with_equal_length_to_old_df], columns=old_dataframe.columns.tolist())

# list(old_dataframe) conveniently converts your column names into a list, that you can easily pass to your new
# dataframe

```

Try to create a new single-row dataframe called Norbert, that contains values for each fruit and uses the column name information of our shoppinglist dataframe!

In [12]:
norbert = pd.DataFrame([[5,5,4,6,3]], columns=shoppinglist.columns.tolist()) ## list(old_dataframe) works also! 
norbert

Unnamed: 0,apples,bananas,oranges,plums,lemon
0,5,5,4,6,3


Let´s add Norbert to our shoppinglist dataframe! You are already familiar with ```.append ``` for adding new elements to list!
We can do just the same in our case
```python

dataframe.append(new_dataframe)

```

Lets append Norbert to our dataframe and check our new dataframe!

In [40]:
shoppinglist = shoppinglist.append(norbert)
shoppinglist

Unnamed: 0,apples,bananas,oranges,plums,lemon
Victoria,3,5,1,1,2
Rhonda,4,9,2,2,4
Elli,7,12,3,3,5
Rebecca,8,2,4,4,1
Lucie,4,9,5,5,7
Isa,1,10,6,6,9
0,5,5,4,6,3


------- 

We already learned at the beginning of this session that we can use ```pd.concat([element1, element2])``` for combining two elements. We can use the same command to combine our two dataframes! Keep in mind, that you might have to specify the axis along which we want to add our new dataframe/row

In [41]:
pd.concat([shoppinglist, norbert])

Unnamed: 0,apples,bananas,oranges,plums,lemon
Victoria,3,5,1,1,2
Rhonda,4,9,2,2,4
Elli,7,12,3,3,5
Rebecca,8,2,4,4,1
Lucie,4,9,5,5,7
Isa,1,10,6,6,9
0,5,5,4,6,3
0,5,5,4,6,3


------- 

**renaming**

What a pity! We forgot to update our index - Norberts name is missing - let´s better change that, before he gets any identity issues!

Do you have an idea how to solve this issue? You essentially already know all the commands to beat the riddler! 

- let´s update our customer list
- let´s set our index 
- let´s check our dataframe

In [42]:
customer.append("Norbert")
shoppinglist = shoppinglist.set_index([customer])
shoppinglist

Unnamed: 0,apples,bananas,oranges,plums,lemon
Victoria,3,5,1,1,2
Rhonda,4,9,2,2,4
Elli,7,12,3,3,5
Rebecca,8,2,4,4,1
Lucie,4,9,5,5,7
Isa,1,10,6,6,9
Norbert,5,5,4,6,3


Ok, tbh this is probably not the most straight-forward way (some peope would maybe also say it´s not pythonic, btw. if you wanna know that pythonic means, check ```import this```). 

Let´s see how we can rename columns or indices in different ways:

- Recap: we just used ```dataframe.column/index.tolist()``` to get a list of our columns/indices --> you already know how to change values in list --> by using ```dataframe.index/columns = your_changed_list``` you can assign new colum or indices



Try to rename our colum "apples" with a specific kind of apple, e.g. GrannySmith by indexing:

In [43]:
columns =shoppinglist.columns.tolist()
columns[0] = "GrannySmith"
shoppinglist.columns = columns
shoppinglist

Unnamed: 0,GrannySmith,bananas,oranges,plums,lemon
Victoria,3,5,1,1,2
Rhonda,4,9,2,2,4
Elli,7,12,3,3,5
Rebecca,8,2,4,4,1
Lucie,4,9,5,5,7
Isa,1,10,6,6,9
Norbert,5,5,4,6,3


- we can also use ```dataframe.rename(index/column = {"old_value:"new_value"}, inplace=True) ``` to solve the issue in just one single line of code. We define ```inplace=True``` which directly allows us to assign the modification to our dataframe. If we stick to the default (i.e. ```False ```) we would need to assign dataframe = dataframe to "save" our modifications

Let´s try to change one of your customer names:

In [44]:
shoppinglist.rename(index = {"Victoria":"Bianca"}, inplace=True)

In [45]:
shoppinglist

Unnamed: 0,GrannySmith,bananas,oranges,plums,lemon
Bianca,3,5,1,1,2
Rhonda,4,9,2,2,4
Elli,7,12,3,3,5
Rebecca,8,2,4,4,1
Lucie,4,9,5,5,7
Isa,1,10,6,6,9
Norbert,5,5,4,6,3


Besides adding and renaming stuff in our dataframe, we could also delete rows or columns by using ```drop``` :
```python

dataframe.drop(index=["element1","element2"])
dataframe.drop(columns=["element1","element2"])
```



Try to delete the first customer in your list:

In [46]:
shoppinglist = shoppinglist.drop(index=["Bianca"]) 
shoppinglist

Unnamed: 0,GrannySmith,bananas,oranges,plums,lemon
Rhonda,4,9,2,2,4
Elli,7,12,3,3,5
Rebecca,8,2,4,4,1
Lucie,4,9,5,5,7
Isa,1,10,6,6,9
Norbert,5,5,4,6,3


**indexing**

We already know from previous sessions, that we can use indexing to assess the first element of a list, the third letter of a string and so on ... in our dataframe universe we can just do the same

*indexing columns or rows*

- the easiest way to index a colum is by using ```dataframe["column"]``` for one column and ```dataframe[["column1", "column2"]] ```for two columns.

Try to index your last two colums:

In [47]:
shoppinglist[["plums", "lemon"]]

Unnamed: 0,plums,lemon
Rhonda,2,4
Elli,3,5
Rebecca,4,1
Lucie,5,7
Isa,6,9
Norbert,6,3


When the index operator ```[]``` is passed a str or int, it attempts to find a column with this particular name and return it as a series ... however if we pass a **slice** to the operator, it changes its behavior and selects rows instead. We can do this with *int* as well *str* !

Try to index all rows expect the fist and last one by using an "int-slicing":

In [48]:
shoppinglist[1:5]

Unnamed: 0,GrannySmith,bananas,oranges,plums,lemon
Elli,7,12,3,3,5
Rebecca,8,2,4,4,1
Lucie,4,9,5,5,7
Isa,1,10,6,6,9


Try to only show what one customer bought at the Wochenmarkt using "str-slicing":


In [49]:
shoppinglist["Elli":"Elli"]

Unnamed: 0,GrannySmith,bananas,oranges,plums,lemon
Elli,7,12,3,3,5


As the simple index operator ```[] ``` is not that flexible, we will have a look at two other ways to index rows and columns! Today we will get to know two different approaches 

- selecting rows and columns by **number** using ```dataframe.iloc[row_selection,column_selection]```

Try to only select the first two rows and all columns:



In [50]:
shoppinglist.iloc[0:2] # you could als use shoppinglist.iloc[0:2,:] : --> "all"

Unnamed: 0,GrannySmith,bananas,oranges,plums,lemon
Rhonda,4,9,2,2,4
Elli,7,12,3,3,5


Try to select row # 2-4 and column # 3-5!

In [55]:
shoppinglist.iloc[1:4,2:5]  

Unnamed: 0,oranges,plums,lemon
Elli,3,3,5
Rebecca,4,4,1
Lucie,5,5,7


- selecting rows and colums by label/index 
- selecting rows with a boolean 

using ```dataframe.loc[row_selection,column_selection]```

Try to select two rows by using the (customer) index:


In [56]:
shoppinglist.loc[["Isa","Elli"]] 

Unnamed: 0,GrannySmith,bananas,oranges,plums,lemon
Isa,1,10,6,6,9
Elli,7,12,3,3,5


Try to select three customers and two columns of your choice!

In [58]:
shoppinglist.loc[["Isa","Elli","Norbert"], ["bananas", "lemon"]] 

Unnamed: 0,bananas,lemon
Isa,10,9
Elli,12,5
Norbert,5,3


Let´s imagine that you are particularly interested in customers that bought more than 8 bananas or exactly 2 lemons. Such questions and row selecting can be easily done by using conditional selections with booleans in ```dataframe.loc[selection]```. Remember what booleans are about? 

If we want select only those customers who bought less than 8 bananas:

In [109]:
shoppinglist.loc[shoppinglist["bananas"] < 8]

Unnamed: 0,apples,bananas,oranges,plums,lemon
Viktoria,3,5,1,1,2
Rebecca,8,2,4,4,1


Let´s see how this works: if we use ```dataframe[selection] == some value``` we get a **Pandas Series** with TRUE or FALSE for all our rows: 

In [60]:
shoppinglist["bananas"] < 8

Rhonda     False
Elli       False
Rebecca     True
Lucie      False
Isa        False
Norbert     True
Name: bananas, dtype: bool

You can also combine two or more conditional statements:

In [103]:
shoppinglist.loc[(shoppinglist["bananas"] < 8) & (shoppinglist["lemon"] == 2)]

Unnamed: 0,apples,bananas,oranges,plums,lemon
Viktoria,3,5,1,1,2
