# Data Formats, Open Data, Tidy Data

In this exercise, you will be working with **Lord of the Rings** data. The dataset can be found on [Kaggle](https://www.kaggle.com/paultimothymooney/lord-of-the-rings-data). 

1. Download and obtain the following CSV file: [`lotr_scripts.csv`](https://www.kaggle.com/datasets/paultimothymooney/lord-of-the-rings-data?select=lotr_scripts.csv). 
2. Document and describe the different data fields.
3. Identify "dirty" data fields and clean them up. Use regex replace, spreadsheets, OpenRefine or whatever you like. 
4. Document your working steps in a Markdown-formatted file. Export your dataset as a clean CSV file. Add both files to this repository (in this directory). 
5. Analyze the data set using shell scripts and/or regex. Document the commands in an additional section in your Markdown-formatted file. 
    * Find the total number of lines and unique words used in the dialogs. 
    * What is the distribution on the three different films? 
    * What are the top 5 characters in the char column?
    * What are the top 5 characters in the dialogues?

In [1]:
import pandas as pd

df = pd.read_csv('lotr_scripts.csv')

print(df.head())

print("Spalten im Dataset:", df.columns)


   Unnamed: 0     char                                             dialog  \
0           0   DEAGOL  Oh Smeagol Ive got one! , Ive got a fish Smeag...   
1           1  SMEAGOL     Pull it in! Go on, go on, go on, pull it in!     
2           2   DEAGOL                                           Arrghh!    
3           3  SMEAGOL                                          Deagol!     
4           4  SMEAGOL                                          Deagol!     

                     movie  
0  The Return of the King   
1  The Return of the King   
2  The Return of the King   
3  The Return of the King   
4  The Return of the King   
Spalten im Dataset: Index(['Unnamed: 0', 'char', 'dialog', 'movie'], dtype='object')


In [17]:
import pandas as pd
import re

# DataFrame einlesen
df = pd.read_csv('lotr_scripts.csv')

# 1. Entferne führende und nachfolgende Leerzeichen aus allen Spalten
df['char'] = df['char'].str.strip()
df['dialog'] = df['dialog'].str.strip()
df['movie'] = df['movie'].str.strip()

# 2. Entferne alle Klammern und deren Inhalt nur aus der 'dialog'-Spalte und entfernt mehrere Leerzeichen und entferne Leerzeichen direkt vor einem Komma
df['dialog'] = df['dialog'].str.replace(r'\s+', ' ', regex=True)  # Mehrere Leerzeichen durch ein einziges ersetzen
df['dialog'] = df['dialog'].apply(lambda x: re.sub(r'\(.*?\)', '', x) if isinstance(x, str) else x)
df['dialog'] = df['dialog'].apply(lambda x: re.sub(r'\s*,', ',', x) if isinstance(x, str) else x) # Entferne Leerzeilen direkt vor einem Komma in der 'dialog' Spalte

# 3. Entferne alle Klammern aus allen Spalten, nur wenn der Wert ein String ist
df['char'] = df['char'].apply(lambda x: re.sub(r'[()]', '', x) if isinstance(x, str) else x)
df['dialog'] = df['dialog'].apply(lambda x: re.sub(r'[()]', '', x) if isinstance(x, str) else x)
df['movie'] = df['movie'].apply(lambda x: re.sub(r'[()]', '', x) if isinstance(x, str) else x)

# 4. Standardisiere die Groß-/Kleinschreibung der Charaktere
df['char'] = df['char'].str.capitalize()  # Erste Buchstaben groß, Rest klein

# 5. Entferne "voice over" und "Voiceover" aus den Charakternamen
df['char'] = df['char'].apply(lambda x: re.sub(r'voice[\s]*over', '', x, flags=re.IGNORECASE) if isinstance(x, str) else x)

# 6. Entferne Zeilen mit leeren Dialogen
df = df[df['dialog'] != '']

# Speichere die bereinigte CSV-Datei
df.to_csv('lotr_scripts_clean.csv', index=False)

# Ausgabe der ersten Zeilen zur Kontrolle
print(df.head())

   Unnamed: 0     char                                             dialog  \
0           0   Deagol  Oh Smeagol Ive got one!, Ive got a fish Smeago...   
1           1  Smeagol       Pull it in! Go on, go on, go on, pull it in!   
2           2   Deagol                                            Arrghh!   
3           3  Smeagol                                            Deagol!   
4           4  Smeagol                                            Deagol!   

                    movie  
0  The Return of the King  
1  The Return of the King  
2  The Return of the King  
3  The Return of the King  
4  The Return of the King  
