## Assignment Applied Statistical Data Analysis Week One
### **22.10.2024**


**Import necessary packages and clean the word list**

First, we import the required packages. We use `re` to clean the text by removing unwanted characters like numbers and punctuation. The `Counter` from the `collections` module is used to count the frequency of each word. `pandas` helps to organize the results in a structured table (DataFrame).

In [18]:
# Necessary imports
import re
from collections import Counter
import pandas as pd

# Text input
text = "In academia and political debates, the notion of 'degrowth' has gained traction since the dawn of the 21st century. While some uncertainty around its exact definition remains, research on degrowth revolves around the idea of reducing resource and energy troughput as a unifying theme. We employ a mixed-methods desgin to systematically review the scientific peer-viewed english literature from 2008 to 20122 that refers 'to' degrowth or 'post-growth' in title, keywords or abstract (N=951)."

# Step 1: Clean and split the text into words, removing punctuation and converting to lowercase
cleaned_text = re.sub(r'[^a-zA-Z\s]', '', text).lower()
word_list = cleaned_text.split()

**Counting the words and determining word length**
Once the text is cleaned, we use `Counter` to count how many times each word appears in the list.
For each word in the list, we also calculate the word length.

**Classifying the type of each word**
The script then checks the type of each word (noun, verb, adjective, adverb, etc.) by examining its prefix or suffix. This classification is based on common prefixes and suffixes for verbs, nouns, and adjectives, as well as lists for pronouns and prepositions.


In [19]:
# Define lists for common prepositions and pronouns
prepositions = ['in', 'on', 'at', 'by', 'with', 'about', 'against', 'between', 'under', 'over', 'through', 'during', 'before', 'after']
pronouns = ['i', 'you', 'he', 'she', 'it', 'we', 'they', 'me', 'him', 'her', 'us', 'them', 'my', 'your', 'his', 'its', 'our', 'their']

# Define typical prefixes and suffixes for verbs
verb_prefixes = ['re', 'dis', 'over', 'un', 'mis', 'out', 'be', 'co', 'de', 'fore', 'inter', 'pre', 'sub', 'trans', 'under']
verb_suffixes = ['ise', 'ate', 'fy', 'en']

# Define typical prefixes and suffixes for nouns
noun_prefixes = ['anti', 'auto', 'bi', 'co', 'counter', 'dis', 'ex', 'hyper', 'in', 'inter', 'kilo', 'mal', 'mega', 'mis', 'mini', 'mono', 'neo', 'out', 'poly', 'pseudo', 're', 'semi', 'sub', 'super', 'sur', 'tele', 'tri', 'ultra', 'under', 'vice']
noun_suffixes = ['tion', 'sion', 'er', 'ment', 'ant', 'ent', 'age', 'al', 'ence', 'ance', 'ery', 'ry', 'ship', 'ism', 'ity', 'ness', 'cy']

# Define typical prefixes and suffixes for adjectives
adjective_prefixes = ['un', 'im', 'in', 'ir', 'il', 'non', 'dis']
adjective_suffixes = ['al', 'ent', 'ive', 'ous', 'ful', 'less', 'able']


# Step 2: Count the frequency of words in the list
word_count = Counter(word_list)

# Initialize an empty list to store word info
word_info = []

# Step 3: Loop through each word and gather the necessary information
for word in word_count:
    # Determine the length of the word
    length = len(word)

    # Identify the word type (noun, verb, adjective, adverb, preposition, pronoun) based on its prefix or suffix
    word_type = 'unknown'
    
    if word in prepositions:
        word_type = 'preposition'
    elif word in pronouns:
        word_type = 'pronoun'
    else:
        # Check for verb prefixes or suffixes
        for prefix in verb_prefixes:
            if word.startswith(prefix):
                word_type = 'verb'
                break
        for suffix in verb_suffixes:
            if word.endswith(suffix):
                word_type = 'verb'
                break

        # Check for noun prefixes or suffixes if not already classified
        if word_type == 'unknown':
            for prefix in noun_prefixes:
                if word.startswith(prefix):
                    word_type = 'noun'
                    break
            for suffix in noun_suffixes:
                if word.endswith(suffix):
                    word_type = 'noun'
                    break

        # Check for adjective prefixes or suffixes if not already classified
        if word_type == 'unknown':
            for prefix in adjective_prefixes:
                if word.startswith(prefix):
                    word_type = 'adjective'
                    break
            for suffix in adjective_suffixes:
                if word.endswith(suffix):
                    word_type = 'adjective'
                    break

        # Check if it ends with "ly" (common for adverbs)
        if word_type == 'unknown' and word.endswith('ly'):
            word_type = 'adverb'

    # Append the information as a dictionary to the word_info list
    word_info.append({
        'word': word,
        'length': length,
        'count': word_count[word],
        'type': word_type
    })

# Step 4: Create a DataFrame from the word_info list
df = pd.DataFrame(word_info)

# Display the DataFrame
print(df)


              word  length  count         type
0               in       2      2  preposition
1         academia       8      1      unknown
2              and       3      2      unknown
3        political       9      1         noun
4          debates       7      1         verb
5              the       3      5      unknown
6           notion       6      1         noun
7               of       2      3      unknown
8         degrowth       8      3         verb
9              has       3      1      unknown
10          gained       6      1      unknown
11        traction       8      1         noun
12           since       5      1      unknown
13            dawn       4      1      unknown
14              st       2      1      unknown
15         century       7      1         noun
16           while       5      1      unknown
17            some       4      1      unknown
18     uncertainty      11      1         verb
19          around       6      2      unknown
20           