# PJE B - Data Analysis

## Data cleaning

For each tweet that we want to analyze, we need to clean the tweet to follow the rules and assist our analysis.

### Rules

- Remove mentions:
    - old: `@user how are you ?`
    - new: `how are you ?`

- Remove hashtags:
    - old: `This is so cool #music`
    - new: `This is so cool`

- Remove retweets:

- Remove link:
    - old: `I love this castle http://castle.com`
    - new: `I love this castle`

- Remove integrated link:
    - old: `Check this out - http://link.com`
    - new: `None`

- Remove happy and sad emoticons in the same tweet:
    - old: `I love this new music :) but there is no tourney soon :(`
    - new: `None`

- Add space before punctutation only if there's a letter before
    - old: `Hello!`
    - new: `Hello !`

In [5]:
from csv import reader
from os import getcwd
from os.path import join
from re import sub

def clean_data(data: str) -> str:
    patterns = [
        (r"@[a-zA-Z0-9]+", ""),                                                                         # Remove mentions
        (r"#[a-zA-Z0-9]+", ""),                                                                         # Remove hashtags
        (r"RT", ""),                                                                                    # Remove retweets
        (r".+ - http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", ""), # Remove attached links
        (r"http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", ""),      # Remove links
        (r".*[:;][\)][^\n]*[:;][\(].*|.*[:;][\(][^\n]*[:;][\)].*", ""),                                 # Remove happy and sad emoticons in the same tweet
        (r"(?<=[a-zA-Z])[!\?\"\.;,]", r" \g<0>"),                                                       # Add space before punctuation only if there's a letter before
        (r"", "")                                                                                       # Remove any remaining whitespace
    ]

    for pattern in patterns:
        if data == "":
            break

        data = sub(pattern[0], pattern[1], data)

    return data

if __name__ == "__main__":
    with open(join(getcwd(), "inputs/testdata.manual.2009.06.14.csv"), "r") as f, open(join(getcwd(), "output/cleaned_data.csv"), "w") as o:
        cleaned_data: set = set()

        for row in reader(f):
            data: str = clean_data(row[5])

            if data in cleaned_data:
                continue

            cleaned_data.add(data)

            if data != "":
                o.write(','.join(map(lambda x : f"\"{x}\"", row[:5])) + ',\"' + data.replace("\"", "\"\"") + "\"\n")