# Data processing

## Contents <a id=ov>
1. [Saving and Loading Data](#save)
2. [Numpy](#np)
3. [Saving and loading data into Pandas dataframe](#save_pandas)
4. [Investigate your Pandas dataframe](#Investigate)
5. [Pandas Series Object ](#series)
6. [Indexing and filtering](#filter)
7. [Plotting](#plt)
8. [Statsmodels](#sm)




## Saving and Loading Data <a id=save>
### Pickle
You can save or load every Python object via pickle:

In [None]:
import pickle

football_players={'Neymar':['PSG',29,100],
                  'Haaland':['BVB',21,150],
                  'Lukaku':['Chelsea',28,100],
                  'Messi':['PSG',34,80],
                  'Goretzka':['Bayern',26,70],
                  'Salah':['Liverpool',29,100],
                  'Kane':['Tottenham',28,120],
                  'Folden':['ManU',21,80]}
#saving
pickle.dump(football_players,open('football_players.p','wb'))
print('football_players.p','saved!')

In [None]:
#load
football_players=pickle.load(open('football_players.p','rb'))
print(football_players)

### Text files

Load Text files with ``open()``!

In [None]:
reader=open('tu_dortmund.txt','r')
reader=open('tu_dortmund.txt',encoding='utf-8')

list_of_rows=[row.replace('\n','') for row in reader]
#print(list_of_rows)
#list_of_rows=[row.replace('\n','') for row in open('tu_dortmund.txt','r')]

for i in range(0,10):
    print(list_of_rows[i])

Write text files with ``write`` or ``print()``.

In [None]:
with open('output_file.txt','w') as out_file:
    for i in range(0,10):
        out_file.write(str(list_of_rows[i])+'\n')
    print('output_file.txt','saved!')

In [None]:
with open('output_file2.txt','w') as out_file:
    for i in range(0,10):
        print(i,list_of_rows[i],'end_of_line',sep='|',end='\n',file=out_file)
    print('output_file_2.txt','saved!')

In [None]:
with open('output_file2.txt','a') as out_file:
    for i in range(15,100):
        print(i,list_of_rows[i],'end_of_line',sep='|',end='\n',file=out_file)
    print('output_file_2.text','appended!')


### JSON files
The best way to save datasets with consist of dictionaries, list, strings and numericals  is ***JSON***, because it is readable for humans and other programming languages:


In [None]:
import json
#save
with open('football_players.json','w') as file:
    json.dump(football_players,file)


#load
with open('football_players.json') as json_object:
    python_object=json.load(json_object)

print(python_object)

### XML files
XML files can be parse with Beautifulsoup. For large files, however, treating the file als text using ``open()`` and searching for keys with string methods can be more efficient.

## Numpy <a id=np>
[Back to Content Overview](#ov)

Numpy is the Python library for vector/matrix operations. Most of the import packages in data science and machine learning.

In [None]:
import numpy as np

### Basics
You can define a vector/matrix in the following way:

In [None]:
test_vector=np.array([[1,4,5],[7,3,6]])
print(type(test_vector))
print(test_vector)

In [None]:
# Get the dimensions of the array:
print(np.shape(test_vector))

In [None]:
# Transpose the array:
print(test_vector.transpose())
print(test_vector.T)

In [None]:
# Change the shape of the array
print(np.array(range(0,10)).reshape((2,5)))

#### Special Arrays:

In [None]:
# Array with zeros:
print(np.zeros((2,3)))

In [None]:
# Array with ones:
print(np.ones((2,3)))

In [None]:
# Array with range of numbers:
print(np.arange(10))
print(np.array(range(10))) # Similar

In [None]:
# Array with equally distributed numbers in a interval
print(np.linspace(0, 1, 5))

#### Random:

In [None]:
# Array with random numbers between 0 and 1:
print(np.random.random(size=5))

In [None]:
# Array with random integers
print(np.random.randint(0,10,size=200))

In [None]:
# Array with normal dist. random numbers:
print(np.random.normal(loc=0.0, scale=1.0, size=6))



#### Indexing:

In [None]:
test_vector=np.array([[1,4,5,6],[7,3,6,3],[3,3,1,9]])
print(test_vector)

In [None]:
# Print frist row
print(test_vector[0,:])

In [None]:
# Print last column
print(test_vector[:,-1])

<span style="color:blue"><b>Task:</b></span> Print the 2x2 matrix in the lower, right corner.

<span style="color:blue"><b>Task:</b></span> Create a  4x4 identity matrix!

In [None]:
# Use a two for loop!


In [None]:
# Use a the diagonal method!


In [None]:
# Use a the method numpy method provides for you!


You can also index arrays with booleans:

In [None]:
print(test_vector)
print(test_vector[test_vector>3])

#### Stacking
You can also stack vector or matrices horizontally or vertically :

In [None]:
test_vector=np.array([[1,4,5],[7,3,6]])
print(test_vector)

In [None]:
#Stack horizontally:
print(np.hstack((test_vector,test_vector)))

In [None]:
#Stack vertically:
print(np.vstack((test_vector,test_vector)))

#### Sorting:

In [None]:
sort_vector=np.random.randint(0,10,size=20)
print(sort_vector)

In [None]:
# Sort vector
print(np.sort(sort_vector))

In [None]:
# Gives back the new indices of the elements if the vector would be sorted:
print(np.argsort(sort_vector))

In [None]:
print(sort_vector[np.argsort(sort_vector)])

In [None]:
# You can use if to sort an vector with order of an other vector:
sort_vector_2=np.random.randint(0,10,size=20)
print(sort_vector_2)
print(sort_vector_2[np.argsort(sort_vector)])

In [None]:
#There are also function to find the minimum or maximum value of vector:
print(np.max(sort_vector))
print(np.min(sort_vector))
print(np.argmax(sort_vector))
print(np.argmin(sort_vector))

#### Mathematical functions:

There are all standard (and not so standard)  [mathematical functions](https://numpy.org/doc/stable/reference/routines.math.html) in numpy:

In [None]:
test_vector=np.array([[1,4,5,6],[7,3,6,3],[3,3,1,9]])
print(test_vector)

In [None]:
#Total sum
print(np.sum(test_vector))

In [None]:
#Sum for every column
print(np.sum(test_vector,axis=0))

In [None]:
#Sum for every row
print(np.sum(test_vector,axis=1))

<span style="color:blue"><b>Task:</b></span> Calculate the mean for the first two rows steperately. (2 Versions)

In [None]:
# You also apply the stand math operators on numpy arrays:
print(test_vector*2)
print(test_vector%3)

In [None]:
print(np.var(test_vector,axis=0)**0.5)
print(np.std(test_vector,axis=0))

In [None]:
print(np.sin(test_vector))
print(np.cos(test_vector))
print(np.tan(test_vector))

<span style="color:blue"><b>Task:</b></span> Rebuild this numpy.sin function with a for loop.

<span style="color:blue"><b>Task:</b></span> Calculate the sum of squares for every column.

<span style="color:blue"><b>Task:</b></span> Standardise the test_vector.

#### Other functions

In [None]:
test_vector=np.array([[1,4,5,6],[7,3,6,3],[3,3,1,9]])
test_vector_2=np.array([[4,2,9,1],[7,4,6,3],[2,4,1,9]])

In [None]:
# Use where to apply condition on vectors.
print(np.where(test_vector>=test_vector_2,test_vector,np.nan))

In [None]:
# Use unique to delete duplicates and retrun the counts.
print(np.unique(test_vector,return_counts=True))

In [None]:
unique_vector,counts=np.unique(test_vector,return_counts=True)
print(unique_vector)
print(counts)

<span style="color:blue"><b>Task:</b></span> Sort the unique values with its counts in descending order.

## Saving and loading data into Pandas dataframe <a id=savepandas>
[Back to Content Overview](#ov)

In [None]:
import pandas as pd

The mean object of the pandas module is the dataframe. A datatable with named rows and columns:

In [None]:
#Import data from a matrix
column_names=['column_'+str(i) for i in range(0,4)]
print(column_names)
df=pd.DataFrame(data=test_vector,columns=column_names,index=None)

In [None]:
print(df)
print(df.index,list(df.index))
print(df.columns)

In [None]:
#You can change the indices and columns via:
df.index=[1,2,3]
df.columns=['one','two','three','four']

print(df)

In [None]:
#You access columns similar to dict value:
print(df['one'])

print(list(df['one']))

In [None]:
for col in df.columns:
    print(df[col])

### Importing and saving data

In [None]:
#Import data from a nested_list
df=pd.DataFrame(data=[[1,4,5,6],[7,3,6,3],[3,3,1,9]],columns=None,index=None)
print(df)


In [None]:
#Import data from a dict
data_dict={'col1':[1,4,5,6],'col2':[7,3,6,3],'col3':[3,3,1,9]}
df=pd.DataFrame(data=data_dict)
print(df)

In [None]:
#Import data from a dict
data_dict={'col1':{'ind1':23,'ind1':1,'ind3':'string'},'col2':{'ind1':89,'ind3':'string'},'col3':{'ind3':'hello'}}
data_dict['col1']['ind1']=688
df=pd.DataFrame(data=data_dict)
print(df)

In [None]:
#Save as csv
df.to_csv('test_df.csv')

In [None]:
import sys
!{sys.executable} -m pip install openpyxl
#Save as excel
df.to_excel('test_df.xlsx')

## Investigate your Pandas dataframe <a id=Investigate>
[Back to Content Overview](#ov)

In [None]:
#Import data from a excel_file
df=pd.read_excel('top_500_football_players.xlsx',sheet_name='final')
print(df)

In [None]:
#Show all columns names
print(df.columns)

In [None]:
#Show all row names
print(df.index)

In [None]:
#Show the ten most valueable players
print(df.head(10))

In [None]:
#Show Mr. Irrelevant
print(df.tail(1))

In [None]:
#Show the data types of columns
print(df.dtypes)

In [None]:
#Count the data types of columns
print(df.dtypes.value_counts())

In [None]:
#Count the age of the players
print(df['Age'].value_counts())

In [None]:
#Show all players with age 36
print(df[df['Age']==36]['Player'])

In [None]:
#Show the most important informations of the dataframe
df.info()

In [None]:
#Show the most important statistics of the dataframe
df.describe()

<span style="color:blue"><b>Task:</b></span> Replicate df.disscribe! Hint: Use numpy to calcalate the statistical measures. 

In [None]:
df_discribe=pd.DataFrame()

## Pandas Series Object <a id=series>
[Back to Content Overview](#ov)

Every column of a dataframe is a Pandas Series Object.

In [None]:
print(df['Age'])

In [None]:
print(type(df['Age']))

You can also define them manually:

In [None]:
test_series = pd.Series(range(3,16))
print(test_series)

In [None]:
# You also give them a name:
test_series.name = 'Magic Series'
print(test_series)

Series have some specifc functions ...

In [None]:
print(test_series.sum())
print(test_series.mean())
print(test_series.value_counts())

... but you can also tread them as numpy vectors and apply every numpy function on them:

In [None]:
print(np.sum(test_series))
print(np.sum(test_series[test_series**2>20]))

In [None]:
# Add the new column to the dataframe
df['Scorer Points']=df['Goals']+df['Assists']
print(df.head(10))

<span style="color:blue"><b>Task:</b></span> Add a new column with goals per game to the df!

You can also apply list comprehensions on series objects:

In [None]:
df['Last Name']= [name.split(' ')[-1] for name in df['Player']]
print(df['Last Name'])

<span style="color:blue"><b>Task:</b></span> Convert the market value to a type float:

In [None]:
print(df['Market value'])

## Indexing and filtering <a id=filter>
[Back to Content Overview](#ov)

In [None]:
df.index=df['Player']
print(df)

<span style="color:blue"><b>Task:</b></span> Restore the old index.

In [None]:
# Query Columns with brackets and Column Name
print(df['Last Name'])
# Query a element with  Column Name and vector index
print(df['Last Name'][3])

### loc
Use loc to get elements with their row and column names:

In [None]:
# Get values for Salah
df.loc['Mohamed Salah',:]

<span style="color:blue"><b>Task:</b></span> Get Salahs Age:

### iloc
Use iloc to get elements with their row and column indices:

In [None]:
# Get values for most valuable player
df.iloc[0,:]

<span style="color:blue"><b>Task:</b></span> Get the last column for the ten most valuable players:

### filter

You can filter the dataframe with conditions similar to the numpy vectors:

In [None]:
# Only keep players with had at least one game
df_filter=df[df['Matches']>0]
print(df_filter)

In [None]:
# Only keep players who are under 25 and have an market values over 50 mio:
df['Market value'] = [float(value.replace('€','').replace('m','')) for value in df['Market value']]
df_filter=df[(df['Age']<25) & (df['Market value']>50)] # use | for or
print(df_filter)

You can also use list comprehensions with booleans to takle more complicated conditions:

<span style="color:blue"><b>Task:</b></span> Only keep players with a middle name. (Name consists of more than 2 words.)

In [None]:
df_filter=

Use ``df.drop_duplicates()`` to delete duplicates from your dataset:

In [None]:
print(df.drop_duplicates())

<span style="color:blue"><b>Task:</b></span> Only show the player with the highest market value for every age.

<span style="color:blue"><b>Task:</b></span> Only show the player with the lowest market value for every age.

Use ``df.sample()`` to get a random sample form your dataset:

In [None]:
# Choose 5 players randomly from the datasset
print(df.sample(5).iloc[:,:6])


<span style="color:blue"><b>Task:</b></span> Choose 10% of the df.

### Group by



In [None]:
print(df.groupby(['Position']).mean())

<span style="color:blue"><b>Task:</b></span> Show the mean market value for each position for each age.

<span style="color:blue"><b>Task:</b></span> Add a new column to the df with the mean market value respective to the position (Use a list comprehension)

In [None]:
df_helper=


<span style="color:blue"><b>Task:</b></span> Apply Position Fixed Effects

In [None]:
df['MV_FE']=