# **Analyzing Geospatial Data using different NLP Techniques (IMDb Summary Dataset - Cities)**

### The listed libraries and modules are essential tools for data analysis, visualization, natural language processing, and machine learning, enabling efficient handling, processing, and interpretation of textual and numerical data.

1. **pandas** (`pd`): For data manipulation and analysis, especially DataFrames.  
2. **re**: For text pattern matching using regular expressions.  
3. **nltk**: For natural language processing tasks like tokenization and stopword removal.  
4. **numpy** (`np`): For numerical computations and array handling.  
5. **seaborn** (`sns`): For statistical data visualization.  
6. **matplotlib.pyplot** (`plt`): For creating basic plots and charts.  
7. **plotly.express** (`px`): For creating interactive visualizations.  
8. **folium**: For creating interactive maps.  
9. **nltk.tokenize.word_tokenize**: To split text into individual words.  
10. **nltk.corpus.stopwords**: Provides a collection of common stopwords for text filtering.  
11. **nltk.util.ngrams**: For generating n-grams (sequences of n items) from text.  
12. **collections.Counter**: For counting occurrences of elements in a list.  
13. **fuzzywuzzy.fuzz**: For measuring text similarity.  
14. **nltk.sentiment.SentimentIntensityAnalyzer**: For sentiment analysis using VADER.  
15. **folium.plugins.MarkerCluster**: For grouping map markers into clusters.  
16. **geopy.geocoders.Nominatim**: For geocoding locations (converting addresses to coordinates).

In [None]:
import pandas as pd
import re
import nltk
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import folium
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.util import ngrams
from collections import Counter
from fuzzywuzzy import fuzz
from nltk.corpus import stopwords
from nltk.sentiment import SentimentIntensityAnalyzer
from folium.plugins import MarkerCluster
from geopy.geocoders import Nominatim



### 1. **Data Import and Cleaning**:
   - Two CSV files are loaded into DataFrames (`df1` and `df2`).
   - Unnecessary columns are removed from both DataFrames (`Unnamed: 0`, `spacy_extracted_locations`, and `region` from `df1`; `spacy_extracted_locations` from `df2`).
   - These DataFrames are merged using `pd.concat()` into a new DataFrame (`merged_df`), and rows with missing values in the `city` or `country` columns are dropped.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
pip install fuzzywuzzy

Collecting fuzzywuzzy
  Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl.metadata (4.9 kB)
Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB)
Installing collected packages: fuzzywuzzy
Successfully installed fuzzywuzzy-0.18.0


In [None]:
df1 = pd.read_csv('/content/drive/MyDrive/Final Project /LATEST DATASET /all_new_cleaned.csv')

In [None]:
df2 = pd.read_csv('/content/drive/MyDrive/Final Project /LATEST DATASET /location_details_finalsummary21k.xls')

In [None]:
df1.head()

Unnamed: 0.1,Unnamed: 0,film_id,final_summary,spacy_extracted_locations,city,region,country
0,0,tt1140100,In marzipan city excitable young foodloving ch...,marzipan city,Jerusalem,Jerusalem District,Israel
1,8,tt1142433,Ryden malby graduates from college and is forc...,"browning, brazil, new york, happerman",Browning; ; New York;,Illinois; ; New York;,United States; Brazil; United States;
2,9,tt11426228,When a halloween store opens in a deserted str...,"alec, zoldrana, stripmall",Alec; ; Querétaro,Amolatar; ; Querétaro,Uganda; ; Mexico
3,10,tt11426232,When the infamous sweet sixteen killer returns...,"vernon, canada",Vernon;,Normandy;,France; Canada
4,11,tt11426562,In the winter of 19421943 a series of notsowel...,rzhev,Rzhev,Tver Oblast,Russia


In [None]:
df1 = df1.drop(columns=['Unnamed: 0', 'spacy_extracted_locations','region'])

In [None]:
df2.head()

Unnamed: 0.1,Unnamed: 0,film_id,final_summary,spacy_extracted_locations,neighbourhood,city,region,country,geocoding_success
0,0,tt0000147,Documentary film depicting the 1897 boxing mat...,carson city nevada,,Carson City,Nevada,United States,success
1,1,tt0002101,The fabled queen of egypts affair with roman g...,egypts,,Oslo,,Norway,success
2,2,tt0002423,The story of madame dubarry the mistress of lo...,france,,,,France,success
3,3,tt0003037,In part two of louis feuillades 5 12hour epic ...,paris,,Paris,Ile-de-France,France,success
4,4,tt0003419,Balduin a student of prague leaves his royster...,prague,,Capital City of Prague,Prague,Czechia,success


In [None]:
df2 = df2.drop(columns=['spacy_extracted_locations' ])

In [None]:
df2.head()

Unnamed: 0.1,Unnamed: 0,film_id,final_summary,neighbourhood,city,region,country,geocoding_success
0,0,tt0000147,Documentary film depicting the 1897 boxing mat...,,Carson City,Nevada,United States,success
1,1,tt0002101,The fabled queen of egypts affair with roman g...,,Oslo,,Norway,success
2,2,tt0002423,The story of madame dubarry the mistress of lo...,,,,France,success
3,3,tt0003037,In part two of louis feuillades 5 12hour epic ...,,Paris,Ile-de-France,France,success
4,4,tt0003419,Balduin a student of prague leaves his royster...,,Capital City of Prague,Prague,Czechia,success


In [None]:
merged_df = pd.concat([df1, df2], ignore_index=True)

In [None]:
merged_df.head()

Unnamed: 0.1,film_id,final_summary,city,country,Unnamed: 0,neighbourhood,region,geocoding_success
0,tt1140100,In marzipan city excitable young foodloving ch...,Jerusalem,Israel,,,,
1,tt1142433,Ryden malby graduates from college and is forc...,Browning; ; New York;,United States; Brazil; United States;,,,,
2,tt11426228,When a halloween store opens in a deserted str...,Alec; ; Querétaro,Uganda; ; Mexico,,,,
3,tt11426232,When the infamous sweet sixteen killer returns...,Vernon;,France; Canada,,,,
4,tt11426562,In the winter of 19421943 a series of notsowel...,Rzhev,Russia,,,,


In [None]:
merged_df = merged_df.dropna(subset=['city', 'country'])
merged_df.head()

Unnamed: 0.1,film_id,final_summary,city,country,Unnamed: 0,neighbourhood,region,geocoding_success
0,tt1140100,In marzipan city excitable young foodloving ch...,Jerusalem,Israel,,,,
1,tt1142433,Ryden malby graduates from college and is forc...,Browning; ; New York;,United States; Brazil; United States;,,,,
2,tt11426228,When a halloween store opens in a deserted str...,Alec; ; Querétaro,Uganda; ; Mexico,,,,
3,tt11426232,When the infamous sweet sixteen killer returns...,Vernon;,France; Canada,,,,
4,tt11426562,In the winter of 19421943 a series of notsowel...,Rzhev,Russia,,,,


In [None]:
merged_df.shape

(22734, 8)

## 2. **Exploding Lists in Columns**:
   - The script splits the `city` and `country` columns by commas or semicolons and "explodes" them to create multiple rows where necessary, ensuring that each city and country has its own row.

In [None]:
# Step 1: Define the function to split columns
def explode_column(df, column):
    return df[column].str.split(r'[;,]', expand=False).explode().reset_index(drop=True)

# Step 2: Apply the function to all relevant columns
merged_df['city'] = explode_column(merged_df, 'city')
merged_df['country'] = explode_column(merged_df, 'country')

# Step 3: Handle repeating the film_id and final_summary
max_len = merged_df.groupby('film_id').size().max()

merged_df['film_id'] = merged_df.groupby('film_id')['film_id'].transform('first')
merged_df['final_summary'] = merged_df.groupby('film_id')['final_summary'].transform('first')

# Step 4: Reset the index for a clean DataFrame
merged_df.reset_index(drop=True, inplace=True)

## 3. **Handling Repeated Data**:
   - The `film_id` and `final_summary` columns are adjusted to ensure they repeat properly to match the number of rows created after the "exploding" step.

In [None]:
merged_df.head()
merged_df.drop(columns=['Unnamed: 0', 'neighbourhood', 'region','geocoding_success'])

Unnamed: 0,film_id,final_summary,city,country
0,tt1140100,In marzipan city excitable young foodloving ch...,Jerusalem,Israel
1,tt1142433,Ryden malby graduates from college and is forc...,Browning,United States
2,tt11426228,When a halloween store opens in a deserted str...,,Brazil
3,tt11426232,When the infamous sweet sixteen killer returns...,New York,United States
4,tt11426562,In the winter of 19421943 a series of notsowel...,,
...,...,...,...,...
22729,tt1139800,Bertrand beauvois a wellknown attorney is in m...,,United States
22730,tt1139805,The veterinary assistant ulla have taken her j...,Boston,United States
22731,tt11398152,An immersion into the rich landscapes of sable...,,
22732,tt11398388,Welcome to riotsville a fictional town built b...,,United Kingdom


## **Text Preprocessing**:
  ### Objectives:
The main objective of the project is to analyze and preprocess textual data by cleaning it of stopwords, punctuation, and unwanted phrases, enabling robust unigram, bigram, and trigram analysis. The ultimate aim is to extract meaningful patterns and insights from a dataset of summaries, ensuring noise is minimized.

### Approaches:
- **Text Preprocessing:** Clean text data by tokenizing, removing punctuation, and converting to lowercase.
- **Stopword Removal:** Eliminate standard stopwords and a custom list of unwanted unigrams, bigrams, and trigrams.
- **Token Analysis:** Use NLTK for unigram, bigram, and trigram extraction while excluding specified unwanted combinations.
- **Output Integration:** Combine processed tokens back into the DataFrame for streamlined analysis.

In [None]:
nltk.download('punkt_tab')
nltk.download('punkt')
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Define unwanted unigrams, bigrams, and trigrams
unwanted_unigrams = [
    'donot', 'iam', 'know', 'itis', 'oh', 'right', 'come', 'like', 'youare', 'thatis',
    'got', 'yes', 'yeah', 'good', 'want', 'think', 'iwill', 'going', 'look', 'man', 'shewill',
    'okay', 'time', 'heis', 'tell', 'hey', 'gonna', 'mr', 'sir', 'didnot', 'ihave', 'wewill',
    'little', 'way', 'let', 'thank', 'love', 'mean', 'weare', 'sorry', 'letis', 'sheis', 'heis',
    'thereis', 'need', 'sure', 'said', 'whatis', 'night', 'people', 'help', 'l', 'itwill',
    'j', 'beeping', 'sheis', 'away', 'aboutthe', 'yeah', 'hellos', 'willdo', 'everthing',
    'sometimes', 'ohyeah', 'ugh', 'oy',  'shitty', 'ones', 'flowersyouare', 'upfor', 'ymca', 'guess', 'maybe', 'needs', 'wants',
    'youcould', 'theywill', 'possibly', 'usually', 'herthe', 'youto', 'backby', 'tum', 'wasquite',
    'looks', 'knowif', 'wehave', 'evidenceyou', 'wonot', 'says', 'n', 'downto', 'u', 'k',
    'ugh', 'oy','ones', 'flowersyouare','hello', 'hi', 'hey', 'greetings', 'goodmorning', 'goodafternoon', 'goodnight', 'howdy',
    'welcome', 'morning', 'evening', 'night', 'please', 'thanks', 'much', 'please', 'excuse',
    'okay', 'alright', 'maybe', 'number', 'ten', 'twenty', 'thirty', 'one', 'two', 'three', 'four',
    'five', 'six', 'seven', 'eight', 'nine', 'hundred', 'thousand', 'million', 'billion', 'first',
    'second', 'third', 'next', 'often', 'usually', 'sometimes', 'rarely', 'possibly', 'likely',
    'definitely', 'almost', 'pretty', 'absolutely', 'perhaps', 'likely', 'uncertain', 'hewill',
]

unwanted_bigrams = [
    ('donot', 'know'), ('iam', 'sorry'), ('donot', 'want'), ('iam', 'gonna'), ('iam', 'going'),
    ('oh', 'god'), ('donot', 'think'), ('ihave', 'got'), ('yes', 'sir'), ('come', 'come'),
    ('thatis', 'right'), ('donot', 'worry'), ('yeah', 'yeah'), ('good', 'night'), ('iam', 'sure'),
    ('youhave', 'got'), ('youare', 'gonna'), ('wait', 'minute'), ('oh', 'yeah'), ('ihad', 'like'),
    ('hey', 'hey'), ('know', 'iam'), ('oh', 'yes'), ('youare', 'going'), ('yes', 'yes'),
    ('good', 'morning'), ('know', 'itis'), ('donot', 'like'), ('oh', 'oh'), ('looks', 'like'),
    ('know', 'donot'), ('know', 'youare'), ('itis', 'like'), ('right', 'right'), ('whatis', 'matter'),
    ('weare', 'going'), ('ha', 'ha'), ('iam', 'afraid'), ('weare', 'gonna'), ('know', 'know'),
    ('okay', 'okay'), ('itis', 'okay'), ('wehave', 'got'), ('new', 'york'), ('didnot', 'know'),
    ('think', 'itis'), ('itis', 'good'), ('iwill', 'tell'), ('look', 'like'), ('itis', 'right')
]

unwanted_trigrams = [
    ('ha', 'ha', 'ha'), ('dialogue', 'default', 'pos'), ('come', 'come', 'come'), ('hey', 'hey', 'hey'),
    ('iam', 'sorry', 'iam'), ('donot', 'know', 'donot'), ('know', 'donot', 'know'), ('good', 'night', 'good'),
    ('oh', 'iam', 'sorry'), ('night', 'good', 'night'), ('yeah', 'yeah', 'yeah'), ('oh', 'god', 'oh'),
    ('donot', 'know', 'iam'), ('la', 'la', 'la'), ('oh', 'oh', 'oh'), ('sorry', 'iam', 'sorry'),
    ('wait', 'wait', 'wait'), ('donot', 'know', 'itis'), ('font', 'font', 'color'), ('font', 'color', 'ffffff'),
    ('donot', 'know', 'youare'), ('yes', 'yes', 'yes'), ('long', 'time', 'ago'), ('l', 'donot', 'know'),
    ('god', 'oh', 'god'), ('iam', 'sorry', 'itis'), ('whoa', 'whoa', 'whoa'), ('itis', 'okay', 'itis'),
    ('morning', 'good', 'morning'), ('good', 'morning', 'good'), ('okay', 'okay', 'okay'),
    ('oh', 'donot', 'know'), ('donot', 'know', 'know'), ('iam', 'sorry', 'didnot'), ('iam', 'sorry', 'donot'),
    ('help', 'help', 'help'), ('okay', 'itis', 'okay'), ('stop', 'stop', 'stop'), ('donot', 'think', 'itis'),
    ('oh', 'yes', 'yes'), ('right', 'right', 'right'), ('donot', 'worry', 'iwill'), ('donot', 'know', 'maybe'),
    ('oh', 'yeah', 'yeah'), ('wait', 'minute', 'wait'), ('aye', 'aye', 'sir'), ('yeah', 'thatis', 'right'),
    ('spanspan', 'style', 'style'), ('thank', 'youare', 'welcome'), ('donot', 'know', 'think')
]

# Function for text preprocessing (lowercase, punctuation removal, etc.)
def preprocess_text(text):
    # Tokenize the text into words using NLTK's word_tokenize
    tokens = nltk.word_tokenize(text.lower())
    tokens = [word for word in tokens if word.isalpha()]  # Only keep alphabetic words
    return tokens

# Function to remove stopwords from tokens
def remove_stopwords(tokens):
    return [word for word in tokens if word not in stop_words and word not in unwanted_unigrams]

merged_df['final_summary_Cleaned'] = merged_df['final_summary'].apply(preprocess_text)

merged_df['final_summary_No_Stopwords'] = merged_df['final_summary_Cleaned'].apply(remove_stopwords)

# Combine all tokens from the 'final_summary_No_Stopwords' column for unigram, bigram, and trigram analysis
all_tokens = [token for tokens in merged_df['final_summary_No_Stopwords'] for token in tokens]

# Count unigrams, bigrams, and trigrams
unigram_counts = Counter(all_tokens)
bigram_counts = Counter(ngrams(all_tokens, 2))
trigram_counts = Counter(ngrams(all_tokens, 3))

# Remove unwanted bigrams and trigrams
bigram_counts = {k: v for k, v in bigram_counts.items() if k not in unwanted_bigrams}
trigram_counts = {k: v for k, v in trigram_counts.items() if k not in unwanted_trigrams}

merged_df['final_summary_No_Stopwords'] = merged_df['final_summary_No_Stopwords'].apply(lambda x: ' '.join(x))

print(merged_df[['film_id', 'final_summary', 'final_summary_No_Stopwords']].head())

## **Geolocation Extraction**:
 ### Objectives:
The objective is to identify and extract geographical references (city and country names) within textual summaries, mapping their positions in the text for further analysis. This aids in understanding geographic mentions and their context in the dataset.

### Approaches:
- **Data Preparation:** Use the cleaned and tokenized text data from the `final_summary_No_Stopwords` column.
- **Geographic Matching:** Iterate through city and country names in each row, matching their occurrences in the text using case-insensitive pattern matching.
- **Index Extraction:** Capture the start and end indices of each match, creating a list of tuples to represent geographic references.
- **Filtering Results:** Retain only rows where geographic references are identified for further analysis.

In [None]:
df = pd.DataFrame(merged_df)
def find_geo_indices(row):
    text = row['final_summary_No_Stopwords']  # The text to search in
    locations = [row['city'], row['country']]  # Geographic references
    geo_indices = []

    # Loop through each location and find its indices
    for loc in locations:
        if pd.notna(loc):
            for match in re.finditer(re.escape(loc.lower()), text.lower()):
                geo_indices.append((match.start(), match.end()))

    return geo_indices

df['geo_indices'] = df.apply(find_geo_indices, axis=1)
df_with_geo_indices = df[df['geo_indices'].map(len) > 0]
print(df_with_geo_indices[['film_id', 'final_summary_No_Stopwords', 'geo_indices']])

## **Identifying Rows with Missing Geographic Indices Using Exact and Fuzzy Matching**

**Objective:**  
The code identifies rows in a DataFrame (`df`) where geographic locations (city and country) are not found in the `final_summary_No_Stopwords` column. It uses a combination of exact and fuzzy matching to locate geographical indices in the text and flag those entries that lack any matches.

**Approaches:**
1. **Exact Match**: It checks if the `city` or `country` appears directly in the `final_summary_No_Stopwords` column using regular expressions.
2. **Fuzzy Match**: If no exact match is found, it applies fuzzy matching (using the `fuzz.partial_ratio` method) to detect similar terms with a threshold of 80% match.
3. **Filter Rows**: After applying the function `find_geo_indices_fuzzy`, the code filters out rows where no geographic indices are found and prints these rows for review.


In [None]:
empty_geo_indices = df[df['geo_indices'].map(len) == 0]

In [None]:
def find_geo_indices_fuzzy(row):
    text = row['final_summary_No_Stopwords']
    locations = [row['city'], row['country']]
    geo_indices = []

    for loc in locations:
        if pd.notna(loc):
            for match in re.finditer(re.escape(loc.lower()), text.lower()):
                geo_indices.append((match.start(), match.end()))

            # Fuzzy match if exact match not found
            for match in locations:
                if fuzz.partial_ratio(match.lower(), text.lower()) > 80:  # Example threshold
                    geo_indices.append((text.lower().find(match.lower()), text.lower().find(match.lower()) + len(match)))

    return geo_indices


In [None]:
df['geo_indices'] = df.apply(find_geo_indices_fuzzy, axis=1)
empty_geo_indices = df[df['geo_indices'].map(len) == 0]

# Check unmatched rows
print(empty_geo_indices[['film_id', 'final_summary_No_Stopwords', 'city', 'country']])

## **Extracting Context Around Geographic Locations Using Tokenized Windows**
**Objective:**  
The code aims to extract contextual information surrounding geographic locations mentioned in the `final_summary_No_Stopwords` column by capturing a defined window of words before and after each location. This helps in understanding the context of geographic references in the text.

**Approaches:**
1. **Tokenization**: The text is tokenized into words using `word_tokenize`.
2. **Context Window**: For each geographic index (start and end positions), a window of words before and after the location is extracted, ensuring the indices stay within bounds.
3. **Filter and Apply**: The function `extract_contexts` is applied to rows with valid geographic indices, creating columns for "Context Before" and "Context After" for each location.

In [None]:
# Define the context window size (number of words before and after the location)
window_size = 10  # Number of tokens to include before and after the location reference

def extract_contexts(text, geo_indices, window_size):
    tokens = word_tokenize(text)

    context_before = []
    context_after = []

    if not geo_indices:
        return context_before, context_after

    # For each location index range in geo_indices, extract context
    for start, end in geo_indices:
        if start < 0 or end <= 0 or start >= len(tokens) or end > len(tokens):
            continue

        # Extract the context window for the given location
        window_start = max(0, start - window_size)
        window_end = min(len(tokens), end + window_size)

        # Extract context before and after location mention
        context_before.append(" ".join(tokens[window_start:start]))
        context_after.append(" ".join(tokens[end:window_end]))

    return context_before, context_after

df_non_empty_geo = df[df['geo_indices'].apply(lambda x: len(x) > 0)]

df_non_empty_geo['Context Before'], df_non_empty_geo['Context After'] = zip(*df_non_empty_geo.apply(
    lambda row: extract_contexts(row['final_summary_No_Stopwords'], row['geo_indices'], window_size), axis=1
))

print(df_non_empty_geo[['film_id', 'Context Before', 'Context After']])

## **Calculating Word Frequency in Contexts Surrounding Geographic Locations**

**Objective:**  
The goal is to calculate the frequency of meaningful words in the context surrounding geographic locations (city and country) mentioned in the text. This helps to identify important words and patterns associated with each location.

**Approaches:**
1. **Context Extraction**: Combines the "Context Before" and "Context After" for each location.
2. **Tokenization & Filtering**: Tokenizes the context into words, filtering out stop words and non-alphabetic words.
3. **Word Frequency Calculation**: Counts the frequency of the remaining meaningful words using the `Counter` class.
4. **Storage**: Stores the calculated word frequencies for each location in a dictionary, displaying the results for each city.

In [None]:
dfgeo = pd.DataFrame(df_non_empty_geo)
def calculate_frequency(contexts):
    stop_words = set(stopwords.words('english'))

    all_words = []
    for context in contexts:
        words = word_tokenize(context)
        filtered_words = [word.lower() for word in words if word.isalpha() and word.lower() not in stop_words]
        all_words.extend(filtered_words)

    word_freq = Counter(all_words)
    return word_freq

location_frequencies = {}

for index, row in dfgeo.iterrows():
    city = row['city']

    contexts = row['Context Before'] + row['Context After']
    word_freq = calculate_frequency(contexts)
    location_frequencies[city] = dict(word_freq)

for city, word_freq in location_frequencies.items():
    print(f'"{city}" -> {word_freq}')

## **Sentiment Analysis of Top Positive and Negative Words in Location Contexts**

**Objective:**  
The goal is to analyze the sentiment of words in the context surrounding geographic locations (cities) and identify the top positive and negative words based on sentiment scores. This helps to understand the emotional tone associated with each location.

**Approaches:**
1. **Sentiment Analysis**: Uses the `SentimentIntensityAnalyzer` from the VADER lexicon to assign sentiment scores to words.
2. **Word Frequency Calculation**: Tokenizes the context surrounding each location, filters out stop words and non-alphabetic words, and calculates the frequency of meaningful words.
3. **Top Word Identification**: For each location, identifies the top N positive and negative words based on their sentiment scores.
4. **Display Results**: Displays the top positive and negative words for each location along with their sentiment scores.

In [None]:
from nltk.sentiment import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()

def calculate_frequency(contexts):
    stop_words = set(stopwords.words('english'))
    all_words = []

    for context in contexts:
        words = word_tokenize(context)
        filtered_words = [word.lower() for word in words if word.isalpha() and word.lower() not in stop_words]
        all_words.extend(filtered_words)

    word_freq = Counter(all_words)
    return word_freq

TOP_N = 30

# Function to analyze sentiment and identify top positive/negative words
def analyze_sentiment(word_freq_dict, top_n):
    results = {}

    for location, words in word_freq_dict.items():
        sentiment_scores = {}

        # Assign sentiment scores to each word
        for word, freq in words.items():
            sentiment_score = sia.polarity_scores(word)['compound']
            sentiment_scores[word] = sentiment_score

        # Sort words by sentiment score (descending for positive, ascending for negative)
        sorted_words = sorted(sentiment_scores.items(), key=lambda x: x[1], reverse=True)
        positive_words = [(word, score) for word, score in sorted_words if score > 0]
        negative_words = [(word, score) for word, score in sorted_words if score < 0]

        # Select top N positive and negative words
        top_positive = positive_words[:top_n]
        top_negative = negative_words[-top_n:]

        # Store results for this location
        results[location] = {
            "Top Positive Words": top_positive,
            "Top Negative Words": top_negative
        }

    return results

# Step 1: Calculate word frequency for each location
location_frequencies = {}
for index, row in dfgeo.iterrows():
    city = row['city']

    # Combine Context Before and Context After into one list of sentences
    contexts = row['Context Before'] + row['Context After']

    # Calculate word frequency for this location
    word_freq = calculate_frequency(contexts)

    # Store the word frequency dictionary
    location_frequencies[city] = dict(word_freq)

# Step 2: Perform sentiment analysis on the word frequencies
sentiment_results = analyze_sentiment(location_frequencies, TOP_N)

# Step 3: Print the results
for location, sentiment_data in sentiment_results.items():
    print(f"Location: {location}")
    print("Top Positive Words:")
    for word, score in sentiment_data["Top Positive Words"]:
        print(f"  {word}: {score:.2f}")
    print("Top Negative Words:")
    for word, score in sentiment_data["Top Negative Words"]:
        print(f"  {word}: {score:.2f}")
    print("-" * 50)

# Step 4: Save the results to a DataFrame
sentiment_df = pd.DataFrame(columns=["Location", "Top Positive Words", "Top Negative Words"])
for location, sentiment_data in sentiment_results.items():
    top_positive_words = ", ".join([word for word, _ in sentiment_data["Top Positive Words"]])
    top_negative_words = ", ".join([word for word, _ in sentiment_data["Top Negative Words"]])

    temp_df = pd.DataFrame([{
        "Location": location,
        "Top Positive Words": top_positive_words,
        "Top Negative Words": top_negative_words
    }])

    # Concatenate the new row with the existing DataFrame
    sentiment_df = pd.concat([sentiment_df, temp_df], ignore_index=True)

print(sentiment_df)
sentiment_df.to_csv("sentiment_results.csv", index=False)

In [None]:
sentiment_df = sentiment_df.dropna(subset=['Location', 'Top Positive Words', 'Top Negative Words'], how='all')
sentiment_df = sentiment_df[(sentiment_df['Location'] != "") &
                        ((sentiment_df['Top Positive Words'] != "") | (sentiment_df['Top Negative Words'] != ""))]

## **Thematic Classification of Text Data by Location**
**Objective:** Use semantic analysis to classify words into thematic categories, enabling the identification of thematic patterns related to specific geographic locations.

#### **Approach:**
1. **Text Preprocessing:** The script text is cleaned and tokenized by removing stopwords and non-alphabetical characters. This allows us to focus on meaningful words related to specific themes.
2. **Word Frequency Calculation:** For each location, the frequency of meaningful words is calculated. This provides insight into the importance of various words in the context of each location.
3. **Theme Classification:** Based on predefined themes (e.g., "Religion," "Recovery," "Tourism"), words are classified into thematic categories using their frequency and relevance. Each location is then assigned words that correspond to its dominant themes.
4. **Result:** The output is a categorized list of themes for each geographic location, highlighting which thematic topics are most prominent in the script associated with that location.

In [None]:
# Predefined themes and associated words
themes = {
    "Religious": [
        "cathedral", "church", "religion", "prayer", "bible", "mosque", "temple", "altar", "worship",
        "holy", "faith", "spiritual", "god", "christianity", "islam", "buddhism", "sacred", "pilgrimage",
        "holy site", "theology", "devotion", "soul", "divine", "holy scripture", "saint", "sanctuary",
        "hymn", "ritual", "monastery", "clergy", "meditation", "holy water", "sacrament", "deity",
        "parish", "divinity"
    ],
    "Recovery": [
        "fire", "astonishing", "reopens", "reborn", "renewed", "resilience", "healing", "regeneration",
        "recover", "restoration", "rebuild", "rehabilitation", "survive", "revival", "revitalize",
        "recovery", "reconstruction", "renew", "restoration", "comeback", "rejuvenation", "hope",
        "overcome", "perseverance", "adaptation", "healing process", "rebuild society"
    ],
    "Tourism": [
        "visit", "favourite", "explore", "attraction", "sightseeing", "vacation", "tourist", "destination",
        "landmark", "holiday", "trip", "journey", "museum", "adventure", "staycation", "resort", "beach",
        "scenic", "tour", "excursion", "vacay", "backpacking", "tourist guide", "cruise", "holidaymakers",
        "postcards", "souvenirs", "travel", "itinerary", "travel photography"
    ],
    "Crime": [
        "robbery", "murder", "theft", "burglary", "assault", "violence", "crime scene", "investigation",
        "detective", "forensics", "criminal", "guilty", "conviction", "suspect", "witness", "court",
        "trial", "injustice", "gang", "violence", "corruption", "smuggling", "drug trade", "miscarriage of justice"
    ],
    "History": [
        "heritage", "monument", "ancient", "legacy", "artifact", "archaeology", "museum", "timeline",
        "historical", "medieval", "empire", "revolution", "battle", "kingdom", "dynasty", "civilization",
        "ancestor", "ruins", "past", "chronicle", "period", "fossil", "archaeological site", "historian",
        "ancient ruins", "past events"
    ],
    "Love": [
        "romance", "affection", "heart", "relationship", "kiss", "love story", "passion", "couple","love",
        "date", "devotion", "intimacy", "commitment", "flirtation", "affectionate", "beloved", "fond","care",
        "adoration", "emotions", "together", "soulmate", "infatuation", "longing", "courtship", "fondness", "hug",
        "marriage", "flirt", "girlfriend", "boyfriend", "adore", "in love", "passionate", "emotional", "ring"
    ],
    "Friendship": [
        "companion", "friend", "loyalty", "support", "bond", "togetherness", "trust", "sharing", "help",
        "side by side", "adventure", "solidarity", "brotherhood", "sisterhood", "camaraderie", "confidant",
        "laughter", "fellowship",  "joy", "companionship", "mutual support", "close friendship",
        "friendship goals"
    ],
 "Family": [
    "family reunion", "household", "generation", "relatives", "grandparents",
    "family gathering", "heritage", "ancestor", "tradition", "bloodline",
    "upbringing", "family bond", "nurturing", "clan", "generation gap", "kin",
    "descendant", "family tree", "maternal", "paternal", "roots",
    "family values", "guardianship", "parenting", "foster", "adoption",
    "lineage", "brotherhood", "sisterhood", "inheritance", "domestic life",
    "togetherness", "bonding", "protection", "family ties", "unity",
    "homemaking", "parental guidance", "household dynamics", "family tradition",
    "progeny", "offspring", "matriarch", "patriarch", "guardian", "blood relatives",
    "extended family", "close-knit", "family legacy", "family portrait",
],

    "War": [
        "battle", "army", "soldier", "warfare", "conflict", "violence", "fight", "military", "weapon",
        "troops", "combat", "strategy", "victory", "defeat", "revolution", "soldiers", "assault",
        "frontline", "invasion", "siege", "peacekeeping", "guerrilla", "trench", "military conflict",
        "hostilities"
    ],
    "Nature": [
        "forest", "river", "mountain", "ocean", "wildlife", "ecosystem", "earth", "climate",
        "nature reserve", "green", "conservation", "landscape", "biodiversity", "environment", "flora",
        "fauna", "sunrise", "sunset", "rainforest", "desert", "beach", "nature trail", "ecology",
        "wilderness", "protected areas", "sustainability"
    ],
    "Politics": [
        "election", "government", "democracy", "republic", "politician", "policy", "candidate", "debate",
        "vote", "congress", "parliament", "president", "law", "legislation", "rights", "campaign",
        "corruption", "bureaucracy", "political", "activism", "public opinion", "governmental",
        "political party", "lobbying"
    ],
    "Science": [
        "research", "discovery", "experiment", "innovation", "technology", "theory", "study", "knowledge",
        "scientist", "lab", "breakthrough", "genetics", "biology", "chemistry", "physics", "space",
        "robotics", "medicine", "cure", "evolution", "laboratory", "technological advancement",
        "scientific community"
    ],
    "Fantasy": [
        "magic", "dragon", "wizard", "fantasy", "sorcery", "enchantment", "wizardry", "mythical",
        "fairy tale", "kingdom", "elf", "dwarf", "monster", "quest", "supernatural", "sword", "warrior",
        "witch", "spell", "legend", "magician", "mystical", "adventure", "hero", "enchanted", "myth"
    ],
    "Sex & Nudity": [
        "sex", "intimacy", "seduction", "pleasure", "erotic", "sexual", "passion", "romantic", "desire",
        "affection", "lust", "sensual", "love making", "temptation", "attraction", "orgasm", "flirt",
        "provocative", "seductive", "sexuality", "explicit", "nudity", "bare", "exposure", "undressed", "skin",
        "topless", "bottomless", "stripping", "nude",  "sensuality", "eroticism", "peep show",
        "body", "lacking clothes", "bare skin", "nude art", "unclothed", "clothing removed", "disrobing"
    ]
}

# Function to calculate meaningful word frequency
def calculate_frequency(contexts):
    stop_words = set(stopwords.words('english'))
    all_words = []

    for context in contexts:
        words = word_tokenize(context)
        filtered_words = [word.lower() for word in words if word.isalpha() and word.lower() not in stop_words]
        all_words.extend(filtered_words)
    word_freq = Counter(all_words)
    return word_freq


# Function to classify words by theme and calculate frequencies, with sorting by word frequency
def classify_themes_for_location(location, words, themes):
    theme_count = {theme: Counter() for theme in themes}
    for word in words:
        for theme, keywords in themes.items():
            if word in keywords:
                theme_count[theme][word] += 1

    # Sort themes by the total frequency of words associated with them
    sorted_theme_count = sorted(theme_count.items(), key=lambda x: sum(x[1].values()), reverse=True)

    print(f"Location: {location}")
    for theme, word_count in sorted_theme_count:
        if word_count:
            theme_words = ", ".join([f"{word} ({count})" for word, count in word_count.items()])
            print(f"  {theme}: {theme_words}")
    print("-" * 50)

    for theme, word_count in sorted_theme_count:
        if word_count:
            for word, count in word_count.items():
                results.append({'Location': location, 'Theme': theme, 'Word': word, 'Count': count})

results = []

for location, words in location_frequencies.items():
    classify_themes_for_location(location, words, themes)

results_df = pd.DataFrame(results)

## **Theme Classification Using BERT-Based Word Embeddings for Geographical References in IMDb summary dataset**

**Purpose:**
The goal of this code is to classify words associated with locations into thematic categories using pre-trained BERT embeddings. It processes location data, computes BERT embeddings for both location-related keywords and thematic keywords, and calculates cosine similarities to assign words to relevant themes. The output is a DataFrame containing location-theme-word associations with their frequencies.

**Approach:**
1. **Load Pre-trained BERT Model:**
   - The BERT tokenizer and model (`bert-base-uncased`) are loaded using the `transformers` library.
   - The model is set to evaluation mode to avoid updating model parameters during inference.

2. **Precompute Embeddings:**
   - For each theme and its corresponding keywords, embeddings are precomputed using BERT.
   - Similarly, embeddings are computed for each unique word in the location data (location frequencies).

3. **Classify Words into Themes:**
   - The function `classify_themes_with_bert` takes location data and associated words, compares the cosine similarity between the word embeddings and theme embeddings, and assigns words to themes if the similarity score exceeds a threshold (defaulted to 0.7).

4. **Cosine Similarity Calculation:**
   - The cosine similarity between word embeddings and theme embeddings is used to quantify the relationship between a word and a theme.

5. **Store and Output Results:**
   - The classified data, including location, theme, word, and its frequency, is stored in a list and converted into a pandas DataFrame (`dfgeo`).
   - The results are saved to a CSV file (`location_themes_with_bert.csv`) for further analysis.



In [None]:
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity

# Load BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
model.eval()

# Function to get BERT embeddings for a word
def get_embedding(word):
    tokens = tokenizer(word, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        outputs = model(**tokens)
    # Use the [CLS] token's embedding as the word representation
    return outputs.last_hidden_state[:, 0, :].numpy()

# Precompute embeddings for all theme keywords
theme_embeddings = {
    theme: {keyword: get_embedding(keyword) for keyword in keywords}
    for theme, keywords in themes.items()
}

# Function to classify words by theme using BERT embeddings
def classify_themes_with_bert(location, words, theme_embeddings, threshold=0.7):
    theme_count = {theme: Counter() for theme in theme_embeddings}

    for word in words:
        word_embedding = get_embedding(word)
        for theme, keywords_embeddings in theme_embeddings.items():
            for keyword, keyword_embedding in keywords_embeddings.items():
                similarity = cosine_similarity(word_embedding, keyword_embedding)[0][0]
                if similarity >= threshold:
                    theme_count[theme][keyword] += 1

    theme_data = []
    for theme, word_count in theme_count.items():
        if word_count:
            for word, count in word_count.items():
                theme_data.append({
                    "location": location,
                    "theme": theme,
                    "word": word,
                    "count": count
                })

    return theme_data


all_theme_data = []
for location, words in location_frequencies.items():
    location_theme_data = classify_themes_with_bert(location, words, theme_embeddings)
    all_theme_data.extend(location_theme_data)

dfgeo = pd.DataFrame(all_theme_data)
print(dfgeo)

dfgeo.to_csv('location_themes_with_bert.csv', index=False)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

## **BERT Model Evaluation**
**Approach Summary**:

1. **Extract Similarity Scores**:  
   - Compute cosine similarity between word embeddings of location-related words and thematic keywords.  
   - Filter scores where similarity is ≥ 0.7 and store them.

2. **One-Sample T-Test**:  
   - Use `ttest_1samp` to compare the mean of similarity scores against a neutral baseline (0.5).  

3. **Evaluate Significance**:  
   - Check if p-value < 0.05 to determine statistical significance.  
   - Conclude whether the similarity scores differ significantly from the baseline.

In [None]:
from scipy.stats import ttest_1samp

# Extract similarity scores for matched themes
similarity_scores = []
for location, words in location_frequencies.items():
    for word in words:
        word_embedding = get_embedding(word)
        for theme, keywords_embeddings in theme_embeddings.items():
            for keyword, keyword_embedding in keywords_embeddings.items():
                similarity = cosine_similarity(word_embedding, keyword_embedding)[0][0]
                if similarity >= 0.7:
                    similarity_scores.append(similarity)

# Perform one-sample t-test
t_stat, p_value = ttest_1samp(similarity_scores, 0.5)  # Test against a neutral baseline
print(f"T-statistic: {t_stat}, P-value: {p_value}")
if p_value < 0.05:
    print("The similarity scores are statistically significant.")
else:
    print("The similarity scores are not statistically significant.")


In [None]:
def classify_themes_for_location(location, words, themes):
    theme_count = {theme: Counter() for theme in themes}  # Initialize empty counters for each theme

    # Check the structure of `themes` and `words`
    if not isinstance(themes, dict):
        print(f"Error: 'themes' should be a dictionary, but got {type(themes)}")
    if not isinstance(words, list):
        print(f"Error: 'words' should be a list, but got {type(words)}")

    # Classify each word in location's word list
    for word in words:
        for theme, keywords in themes.items():
            if isinstance(keywords, list):  # Ensure 'keywords' is a list
                if word in keywords:
                    theme_count[theme][word] += 1
            else:
                print(f"Error: Keywords for theme '{theme}' is not a list but a {type(keywords)}")

    # Sort themes by the total frequency of words associated with them
    sorted_theme_count = sorted(theme_count.items(), key=lambda x: sum(x[1].values()), reverse=True)

    # Print results for the location
    print(f"Location: {location}")
    for theme, word_count in sorted_theme_count:
        if word_count:  # Only print themes that have associated words
            theme_words = ", ".join([f"{word} ({count})" for word, count in word_count.items()])
            print(f"  {theme}: {theme_words}")
    print("-" * 50)

    # Store results in a structured way for DataFrame creation
    for theme, word_count in sorted_theme_count:
        if word_count:  # Only save themes that have associated words
            for word, count in word_count.items():
                results.append({'Location': location, 'Theme': theme, 'Word': word, 'Count': count})


In [None]:
theme_df = dfgeo.copy()

## **Focusing on Top 3 Countries for City-Level Analysis**

The analysis will focus on the top three countries with the most cities to streamline processing, given the total of 4,037 cities. These top countries are:

1. **United States** with 1,434 cities  
2. **United Kingdom** with 416 cities  
3. **India** with 363 cities  

The steps are as follows:  
- **Group by Country**: Aggregate the unique city names for each country.  
- **Filter by Country**: Extract the list of cities specific to the United States, United Kingdom, and India.  
- **Convert to Lists**: Transform the extracted city data into list formats for further use.  
- **Display Results**: Present the lists of cities for the United States, United Kingdom, and India.  

This targeted approach simplifies the analysis by focusing on the cities within the three countries with the highest representation in the dataset.

In [None]:
num_unique_cities = df['city'].nunique()
print(f"Number of unique cities: {num_unique_cities}")

## **Analysis of the Top Countries with the Most Cities and Unique City Counts**

**Objective:**
The objective of this analysis is to identify and visualize the top countries with the most cities, based on the unique city count, and provide insights on the number of cities per country.

**Approaches:**
1. **Data Cleaning:** The dataset was cleaned by removing any null values and extra whitespace from the country and city columns.
2. **City Count per Country:** The number of unique cities per country was calculated using the `groupby` method, and the results were sorted in descending order.
3. **Top Countries Visualization:** The top 3 countries with the most unique cities were extracted and visualized using a bar plot.
4. **Additional Insights:** A breakdown of the unique cities in the top countries was also displayed in the results.

In [None]:
df_city = merged_df.copy()

In [None]:
# Step 1: Ensure data is clean and there are no extra spaces or unexpected characters in country or city names
df_city['country'] = df_city['country'].str.strip().str.split(':').str[0]  # Skip anything after ':'

# Remove rows where country is empty or contains just the colon
df_city = df_city[df_city['country'] != '']

df_city['city'] = df_city['city'].str.strip()

# Step 2: Remove duplicate city-country combinations (no duplicate cities within the same country)
df_unique = df_city.drop_duplicates(subset=["country", "city"])

# Step 3: Group by country and count the unique cities per country
country_city_counts = df_unique.groupby("country")["city"].nunique().reset_index(name="city_count")

# Step 4: Sort by city count and display the top 3 countries
top_3_countries = country_city_counts.sort_values(by="city_count", ascending=False).head(3)

# Step 5: Print results
for _, row in top_3_countries.iterrows():
    print(f"{row['country']}: {row['city_count']} cities")


In [None]:
# US cities list
us_cities = [
    "Browning", "New York", "Corsicana", "Washington", "Nashville-Davidson", "Los Angeles",
    "Kansas City", "Kenosha", "Detroit", "Albuquerque", "Chicago", "New Orleans", "Ruston",
    "Madras", "Chicago", "San Francisco", "Glendale", "Miami", "Houston", "Austin", "Seattle",
    "Cleveland", "Oakland", "San Diego", "Las Vegas", "Brooklyn Park", "El Paso", "Anaheim",
    "Philadelphia", "Dallas", "Atlanta", "Cincinnati", "Minneapolis", "Tulsa", "Bismarck",
    "St. Louis", "Baltimore", "Mesa", "Memphis", "San Antonio", "Charlotte", "San Jose",
    "Jacksonville", "Newark", "Portland", "Miami Beach", "Fairbanks", "Seattle", "Pittsburgh",
    "Mansfield", "Palm Springs", "Phoenix", "Eden", "Litchfield", "Columbia", "Tampa", "Salem",
    "Baltimore", "Chicago", "Los Angeles", "Boston", "Bismarck", "Dubuque", "Des Moines", "Omaha",
    "Philadelphia", "Columbus", "Indianapolis", "Elk Lick Township", "Dallas", "Reno", "Fort Collins",
    "Doffing", "Beverly Hills", "Washington", "Knoxville", "Cincinnati", "Memphis", "Fresno",
    "Denver", "San Antonio", "Salt Lake City", "Fort Worth", "Indianapolis", "Las Vegas", "Chicago"
]

# UK cities list
uk_cities = [
    "London", "Manchester", "Bristol", "Glasgow", "Edinburgh", "Birmingham", "Leeds", "Liverpool",
    "Cambridge", "Oxford", "Cardiff", "Sheffield", "York", "Leicester", "Newcastle", "Nottingham",
    "Southampton", "Exeter", "Coventry", "Brighton", "Reading", "Derby", "Stoke-on-Trent",
    "Sunderland", "Loughborough", "Worcester", "Luton", "Basingstoke", "Milton Keynes"
]

# Indian cities list
indian_cities = [
    "Mumbai", "New Delhi", "Bangalore", "Hyderabad", "Chennai", "Kolkata", "Pune", "Ahmedabad",
    "Jaipur", "Lucknow", "Kanpur", "Nagpur", "Indore", "Patna", "Vadodara", "Surat", "Chandigarh",
    "Bhopal", "Vijayawada", "Kochi", "Coimbatore", "Visakhapatnam", "Madurai", "Ranchi", "Agra",
    "Faridabad", "Noida", "Ghaziabad", "Gurugram", "Meerut", "Jammu", "Raipur", "Shimla", "Tirunelveli",
    "Jabalpur", "Mangalore", "Dibrugarh", "Udaipur", "Gwalior", "Puducherry", "Dehradun"
]


In [None]:
dfgeo = dfgeo.drop(columns=['Unnamed: 0', 'neighbourhood', 'region', 'geocoding_success', 'final_summary', 'final_summary_Cleaned'])

## **Word Frequency Analysis and Thematic Classification for Cities in the UK, USA, and India**

**Objective:**  
The objective of this analysis is to calculate the frequency of meaningful words in city descriptions, classify them into thematic categories, and generate formatted output showing the prominence of different themes in cities across the UK, USA, and India.

**Approaches:**  
1. **Word Frequency Calculation:** The analysis processes the city descriptions, removing stopwords and non-alphabetic characters, and calculates the frequency of meaningful words for each location.
2. **Thematic Classification:** The calculated word frequencies are compared against predefined themes to classify each word into a specific theme.
3. **Filtering Dataset:** The dataset is filtered to focus on cities in the UK, USA, and India.
4. **Formatted Output Generation:** For each location, the analysis produces formatted output showing the thematic classification and frequency of related words.

In [None]:
import pandas as pd
from collections import Counter
from nltk.corpus import stopwords

# Function to calculate word frequency
def calculate_frequency(contexts):
    all_words = []
    stop_words = set(stopwords.words('english'))

    for context in contexts:
        if isinstance(context, list):
            context = ' '.join(context)  # Convert list to string
        words = context.lower().split()
        all_words.extend([word for word in words if word.isalpha() and word not in stop_words])
    return Counter(all_words)

# Function to classify words by theme and generate formatted output
def classify_themes_for_location(location, word_freq, themes):
    theme_data = []

    for word, count in word_freq.items():
        for theme, keywords in themes.items():
            if word in keywords:
                theme_data.append({
                    'city': location[0],
                    'country': location[1],
                    'theme': theme,
                    'word': word,
                    'count': count
                })

    return theme_data

# Main function to process the dataset
def main(df_city, themes):
    # Filter the dataset for cities in UK, USA, and India
    filtered_df = df_city[df_city['country'].isin(['United Kingdom', 'United States', 'India'])]

    all_theme_data = []

    for index, row in filtered_df.iterrows():
        location = (row['city'], row['country'])
        context = row['final_summary_No_Stopwords']

        word_freq = calculate_frequency([context])
        theme_data = classify_themes_for_location(location, word_freq, themes)
        all_theme_data.extend(theme_data)

    # Create a new DataFrame with the theme data
    df_themed = pd.DataFrame(all_theme_data)

    return df_themed  # Return the new DataFrame with themes

# Assuming you have the 'themes' dictionary available
if __name__ == "__main__":
    df_themed = main(df_city, themes)  # Pass themes to the main function
    df_themed.to_csv("themed_results.csv", index=False)  # Save the new DataFrame to a CSV


In [None]:
# Create DataFrame
df = pd.DataFrame(df_themed)

# Group by location and theme, summing the count
grouped = df.groupby(['city', 'theme'], as_index=False)['count'].sum()

# Calculate total counts for each location
grouped['total_count'] = grouped.groupby('city')['count'].transform('sum')

# Calculate the percentage for each theme within a location
grouped['percentage'] = (grouped['count'] / grouped['total_count']) * 100

# Format the percentage column to 2 decimal places
grouped['percentage'] = grouped['percentage'].round(2)

# Display the result
print(grouped)


## **Top 3 Themes and Keywords for Cities in the US, UK, and India**

**Objective:**  
The goal of this analysis is to identify and display the top 3 themes and their corresponding top 3 keywords for cities in the United States, United Kingdom, and India. This allows for a thematic understanding of city data based on specific keywords.

**Approaches:**  
1. **Data Filtering:** The dataset is filtered to include only cities from the United States, United Kingdom, and India.
2. **Top 3 Themes per City:** For each city, the analysis identifies the top 3 most prominent themes based on theme frequency.
3. **Keyword Extraction:** For each theme, the top 3 most frequent keywords are extracted from the corresponding list of words. Keywords are split from any counts and processed accordingly.
4. **Results Storage and Display:** The top themes and their respective keywords for each city are stored in a dictionary and displayed in a structured format, showing the country, city, theme, and associated keywords.

In [None]:
# Step 1: Standardize column names
df_themed.columns = df_themed.columns.str.strip().str.lower()

# Verify that the required columns exist
required_columns = ['city', 'country', 'theme', 'word']
if not all(col in df_themed.columns for col in required_columns):
    raise ValueError(f"Missing one or more required columns: {required_columns}")

# Step 2: Filter the DataFrame for the countries of interest
countries = ['United States', 'United Kingdom', 'India']
filtered_df = df_themed[df_themed['country'].isin(countries)]

# Initialize a list to store results for each country, city, theme, and keyword
results = []

# Process each country separately
for country, country_data in filtered_df.groupby('country'):
    for city, city_data in country_data.groupby('city'):
        if pd.isna(city):
            continue

        # Get the top 3 themes for this city
        theme_counts = city_data['theme'].value_counts().head(3)

        for theme in theme_counts.index:
            theme_data = city_data[city_data['theme'] == theme]

            # Extract and count keywords
            all_keywords = []
            for words in theme_data['word'].dropna():
                all_keywords.extend([word.split(' (')[0].strip() for word in words.split(',')])

            # Get the top 3 keywords for this theme
            keyword_counts = pd.Series(all_keywords).value_counts().head(3)

            for keyword, count in keyword_counts.items():
                results.append({
                    'Country': country,
                    'City': city,
                    'Theme': theme,
                    'Keyword': keyword,
                    'Count': count
                })

# Create a DataFrame from the results
results_df = pd.DataFrame(results)

# Display the results
print(results_df)


 ## **Sentiment Analysis of City Themes Using VADER**

**Objective:**  
To analyze the sentiment of text data associated with city themes, categorizing it as Positive, Negative, or Neutral based on sentiment scores derived using VADER (Valence Aware Dictionary and sEntiment Reasoner).  

**Approaches Taken:**  
1. Loaded a dataset containing city-related information, including a text column (`Words`) for analysis.  
2. Leveraged the NLTK library's VADER sentiment analyzer to compute compound sentiment scores for each text entry.  
3. Defined a function to classify sentiment into three categories—Positive, Negative, or Neutral—based on the computed compound scores.  
4. Augmented the dataset with sentiment scores and categories, allowing for deeper insights into the emotional tone of city themes.  
5. Saved the processed data to a CSV file for further analysis or reporting.  

In [None]:
from nltk.sentiment import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()
print(results_df.head())

# Function to get sentiment score using VADER
def get_vader_score(text):
    score = sia.polarity_scores(text)
    return score['compound']

# Apply VADER sentiment analysis to the 'Words' column
results_df['sentiment_score_vader'] = results_df['Keyword'].apply(get_vader_score)

# Classify sentiment based on VADER scores
def classify_sentiment(score):
    if score > 0:
        return 'Positive'
    elif score < 0:
        return 'Negative'
    else:
        return 'Neutral'

results_df['sentiment_category_vader'] = results_df['sentiment_score_vader'].apply(classify_sentiment)
print(results_df[['Country', 'City', 'Theme', 'Keyword', 'sentiment_score_vader', 'sentiment_category_vader']].head())
results_df.to_csv('sentiment_analysis_results.csv', index=False)

## **VADER Evaluation**
**Approach Summary**:

1. **Count Sentiment Categories**:  
   - Use `value_counts` to get the frequency of each sentiment category in `results_df`.

2. **Calculate Expected Distribution**:  
   - Assume an equal distribution of sentiment categories for comparison.  

3. **Perform Chi-Square Test**:  
   - Use `chi2_contingency` to test if the observed and expected distributions differ significantly.  

4. **Evaluate Significance**:  
   - If p-value < 0.05, conclude that the distribution of sentiment categories is statistically significant.  

In [None]:
from scipy.stats import chi2_contingency
import numpy as np

# Frequency counts of sentiment categories
category_counts = results_df['sentiment_category_vader'].value_counts()

# Expected distribution (equal probabilities)
total = sum(category_counts)
expected_counts = np.full(len(category_counts), total / len(category_counts))

# Perform Chi-Square Test
chi2_stat, p_value, _, _ = chi2_contingency([category_counts, expected_counts])

print(f"Chi-Square Statistic: {chi2_stat}, P-value: {p_value}")
if p_value < 0.05:
    print("The distribution of sentiment categories is statistically significant.")
else:
    print("The distribution of sentiment categories is not statistically significant.")

Chi-Square Statistic: 1942.776422250805, P-value: 0.0
The distribution of sentiment categories is statistically significant.


In [None]:
sentiment = results_df.copy()

In [None]:
filtered_df = sentiment[(sentiment['sentiment_score_vader'] != 0.0000) | (sentiment['sentiment_category_vader'] != 'Neutral')]


In [None]:
pattern = '^[a-zA-Z ]*$'
filtered_df = filtered_df[filtered_df['Country'].str.match(pattern, na=False) & filtered_df['City'].str.match(pattern, na=False)]

In [None]:
grouped_df = (
    filtered_df.groupby(['Country', 'City'])
    .agg({
        'Theme': lambda x: ', '.join(sorted(set(x))),  # Combine unique themes
        'Keyword': lambda x: ', '.join(sorted(set(x))),  # Combine unique keywords
        'sentiment_score_vader': ['sum'],  # Calculate total sentiment scores
        'sentiment_category_vader': lambda x: {
            'Positive': (x == 'Positive').sum(),
            'Negative': (x == 'Negative').sum()
        }
    })
    .reset_index()
)

# Flatten MultiIndex columns resulting from aggregation
grouped_df.columns = ['Country', 'City', 'Themes', 'Keywords', 'Total Sentiment Score', 'Sentiment Counts']

# Expand 'Sentiment Counts' dictionary into separate columns
sentiment_counts = pd.json_normalize(grouped_df['Sentiment Counts'])
grouped_df = pd.concat([grouped_df.drop(columns='Sentiment Counts'), sentiment_counts], axis=1)

# Rename columns for clarity
grouped_df.rename(columns={'Positive': 'Total Positive Sentiments', 'Negative': 'Total Negative Sentiments'}, inplace=True)

In [None]:
grouped_df.head()

### **Geographic Sentiment and Frequency Analysis of Locations in Texts**

### Objective:
This script analyzes the frequency and sentiment of geographic locations mentioned in text. The goal is to identify biases or patterns in how frequently locations are mentioned and the sentiment associated with those locations. Two key analyses are performed: frequency of mentions (intensity) and sentiment analysis (fairness).

### Approach:

1. **Frequency Calculation (Intensity)**:
   - The script calculates the frequency of mentions for each location by tokenizing the surrounding text and counting the occurrences of significant words, ignoring common stopwords (like 'the', 'a', etc.).
   - The frequency data is organized by location and includes the most frequently mentioned words associated with each location.

2. **Sentiment Analysis (Bias/Fairness)**:
   - For each location, sentiment scores are calculated using the **VADER SentimentIntensityAnalyzer** from NLTK. Positive and negative sentiment words are identified for each location.
   - An average sentiment score is calculated for each location to assess the overall sentiment (positive or negative) associated with it.

3. **Data Organization**:
   - The results of the frequency analysis and sentiment analysis are stored in two separate DataFrames:
     - `df_intensity`: Contains the frequency of mentions and top associated words for each location.
     - `df_bias`: Contains the average sentiment score, top positive words, and top negative words for each location.

4. **Visualization**:
   - **Frequency of Mentions (Intensity)**: A bar plot is created to visualize the top 20 locations based on the frequency of mentions.
   - **Sentiment Scores (Bias)**: A bar plot is created to visualize the top 20 locations based on the average sentiment score.

5. **Statistical Testing**:
   - A **Chi-squared test** is performed to determine if there is any significant association between the locations' mention frequency and the grouping of locations based on a predefined frequency category.


In [None]:
# Ensure NLTK resources are downloaded
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('vader_lexicon')


# Function to calculate frequency of mentions (Intensity)
def calculate_frequency(contexts):
    stop_words = set(stopwords.words('english'))
    all_words = []

    for context in contexts:
        # Ensure context is a string and not NaN or None
        if isinstance(context, str):
            words = word_tokenize(context)
            filtered_words = [word.lower() for word in words if word.isalpha() and word.lower() not in stop_words]
            all_words.extend(filtered_words)

    return Counter(all_words)

# Function to analyze sentiment and bias (Fairness)
def analyze_sentiment(word_freq_dict, top_n=5):
    sia = SentimentIntensityAnalyzer()
    sentiment_scores = {'positive': [], 'negative': []}

    for word, freq in word_freq_dict.most_common(top_n):
        score = sia.polarity_scores(word)
        if score['compound'] >= 0.05:
            sentiment_scores['positive'].append(word)
        elif score['compound'] <= -0.05:
            sentiment_scores['negative'].append(word)

    # Calculate average sentiment score
    avg_sentiment_score = sum(sia.polarity_scores(word)['compound'] for word in sentiment_scores['positive'] + sentiment_scores['negative']) / len(sentiment_scores['positive'] + sentiment_scores['negative']) if sentiment_scores['positive'] or sentiment_scores['negative'] else 0
    return avg_sentiment_score, sentiment_scores

# Analyze frequency and sentiment for each country
location_frequencies = {}
sentiment_analysis = {}

for index, row in dfgeo.iterrows():
    country = row['city']
    contexts = str(row['Context Before']) + " " + str(row['Context After'])

    # Frequency calculation
    word_freq = calculate_frequency([contexts])
    location_frequencies[country] = word_freq

    # Sentiment analysis
    avg_sentiment_score, sentiment_scores = analyze_sentiment(word_freq, top_n=5)
    sentiment_analysis[country] = {
        'Average Sentiment Score': avg_sentiment_score,
        'Top Positive Words': sentiment_scores['positive'],
        'Top Negative Words': sentiment_scores['negative'],
    }

# Prepare the results for display
intensity_data = []
for country, word_freq in location_frequencies.items():
    top_words = ', '.join([word for word, _ in word_freq.most_common(3)])  # Top 3 associated words
    intensity_data.append([country, sum(word_freq.values()), top_words])

# Convert to DataFrame for intensity (Frequency of Mentions)
df_intensity = pd.DataFrame(intensity_data, columns=['Location', 'Frequency of Mentions', 'Top Associated Words'])

# Prepare the results for sentiment analysis (Bias of Sentiment)
bias_data = []
for country, sentiment_data in sentiment_analysis.items():
    bias_data.append([country, sentiment_data['Average Sentiment Score'], ', '.join(sentiment_data['Top Positive Words']), ', '.join(sentiment_data['Top Negative Words'])])

# Convert to DataFrame for sentiment (Bias of Sentiment)
df_bias = pd.DataFrame(bias_data, columns=['Location', 'Average Sentiment Score', 'Top Positive Words', 'Top Negative Words'])

# Display results
print("Intensity (Frequency of Mentions):")
print(df_intensity)

print("\nFairness (Bias of Sentiment):")
print(df_bias)

In [None]:
# Filter the top 20 locations based on frequency of mentions (intensity)
df_intensity_top20 = df_intensity.nlargest(20, 'Frequency of Mentions')

# Set up the matplotlib figure for intensity
plt.figure(figsize=(10, 6))

# Visualize the top 20 Frequency of Mentions (Intensity)
sns.barplot(x='Frequency of Mentions', y='Location', data=df_intensity_top20, palette='viridis')

plt.title('Top 20 Locations by Frequency of Mentions')
plt.xlabel('Frequency of Mentions')
plt.ylabel('Location')
plt.show()

# Filter the top 20 locations based on average sentiment score (bias)
df_bias_top20 = df_bias.nlargest(20, 'Average Sentiment Score')

# Set up the matplotlib figure for bias
plt.figure(figsize=(10, 6))

# Visualize the top 20 Average Sentiment Score (Bias)
sns.barplot(x='Average Sentiment Score', y='Location', data=df_bias_top20, palette='coolwarm')

plt.title('Top 20 Locations by Average Sentiment Score')
plt.xlabel('Average Sentiment Score')
plt.ylabel('Location')
plt.show()

In [None]:
import scipy.stats as stats
contingency_table = pd.crosstab(df_intensity['Location Group'], df_intensity['Frequency of Mentions'])

# Perform Chi-squared test
chi2_stat, p_value, dof, expected = stats.chi2_contingency(contingency_table)

# Output result
print(f"Chi-squared Stat: {chi2_stat}")
print(f"P-Value: {p_value}")
print(f"Degrees of Freedom: {dof}")
print(f"Expected Frequencies: \n{expected}")

## **Interactive Theme Map with Geolocated Clusters**  

#### **Purpose:**  
To create an interactive visualization of themes distributed across global locations using geospatial clustering. This allows users to explore and identify the dominant themes in various regions, along with associated keywords.

---

#### **Approaches:**

1. **Data Preparation and Cleaning:**  
   - Processed and grouped the dataset by `theme` and `location`.  
   - Extracted theme-specific keywords (`words`) and calculated the most frequent ones.  
   - Flattened keyword lists to ensure proper rendering in the visualization.

2. **Geolocation Mapping:**  
   - Used the `geopy` library to fetch latitude and longitude for countries or locations based on their names.  
   - Assigned marker locations on the map corresponding to geocoded coordinates.  

3. **Theme-Based Marker Customization:**  
   - Defined a color scheme for each theme using a dictionary (`get_theme_color`) for visual clarity.  
   - Incorporated circular markers with radius and color coded by theme for immediate differentiation.  

4. **Interactive Features:**  
   - Pop-ups for each marker display detailed information, including the theme, location, and associated keywords.  
   - Tooltip with a hover effect for quick identification of themes.  

5. **Cluster Visualization:**  
   - Used Folium's `MarkerCluster` to group nearby markers dynamically. This reduces clutter and enhances map usability, particularly in regions with dense data points.  

6. **Custom Legend:**  
   - Added a modern styled, fixed-position legend to explain theme colors using HTML and CSS.  

7. **Map Styling and Aesthetics:**  
   - Chose the `CartoDB Positron` tile style for a clean and minimalistic background.  
   - Customized map elements such as coastline visibility and zoom levels for better user interaction.  

---

This interactive map combines clustering and color-coded themes to provide an intuitive way to explore the distribution of themes worldwide. The additional legend and interactivity make it user-friendly and visually appealing.

In [None]:
import folium
from folium.plugins import MarkerCluster
from geopy.geocoders import Nominatim

# Initialize geolocator
geolocator = Nominatim(user_agent="geo_viz", timeout=10)

# Initialize the map
world_map = folium.Map(location=[20, 0], zoom_start=2, tiles='CartoDB positron')
marker_cluster = MarkerCluster().add_to(world_map)

# Function to assign a color based on themes
def get_theme_color(theme):
    theme_colors = {
        "Family": "#0074D9",       # Blue
        "Love": "#FF69B4",         # Pink
        "War": "#FF6347",          # Red
        "Politics": "#8A2BE2",     # Purple
        "Crime": "#FFA500",        # Orange
        "Nature": "#32CD32",       # Lime Green
        "Religious": "#9400D3",    # Dark Violet
        "Tourism": "#228B22",      # Forest Green
        "Friendship": "#ADD8E6",   # Light Blue
        "History": "#20B2AA",      # Light Sea Green
        "Science": "#FFD700",      # Gold
        "Fantasy": "#4B0082",      # Indigo
        "Recovery": "#DAA520",     # Goldenrod
        "Sex & Nudity": "#FF1493"  # Deep Pink for Sex & Nudity
    }
    return theme_colors.get(theme, "#808080")  # Default color if theme is not found

us_cities = [
    "New York", "Washington", "Los Angeles", "Kansas City", "Chicago", "New Orleans", "San Francisco", "Miami", "Houston",
    "Seattle", "Cleveland", "San Diego", "Las Vegas", "Philadelphia", "Dallas", "Atlanta", "Minneapolis", "Tulsa", "Bismarck",
    "St. Louis", "Baltimore", "Mesa", "Memphis", "San Antonio", "Charlotte", "San Jose", "Jacksonville", "Portland", "Phoenix",
    "Tampa", "Salem"
]

uk_cities = [
    "London", "Manchester", "Bristol", "Glasgow", "Edinburgh", "Birmingham", "Leeds", "Liverpool",
    "Cambridge", "Oxford", "Cardiff", "Sheffield", "York", "Leicester", "Newcastle", "Nottingham",
    "Southampton", "Exeter", "Coventry", "Brighton", "Reading", "Derby", "Stoke-on-Trent",
    "Sunderland", "Loughborough", "Worcester", "Luton", "Basingstoke", "Milton Keynes"
]

indian_cities = [
    "Mumbai", "New Delhi", "Bangalore", "Hyderabad", "Chennai", "Kolkata", "Pune", "Ahmedabad",
    "Jaipur", "Lucknow", "Kanpur", "Nagpur", "Indore", "Patna", "Vadodara", "Surat", "Chandigarh",
    "Bhopal", "Vijayawada", "Kochi", "Coimbatore", "Visakhapatnam", "Madurai", "Ranchi", "Agra",
    "Faridabad", "Noida", "Ghaziabad", "Gurugram", "Meerut", "Jammu", "Raipur", "Shimla", "Tirunelveli",
    "Jabalpur", "Mangalore", "Dibrugarh", "Udaipur", "Gwalior", "Puducherry", "Dehradun"
]

# Filter the DataFrame for US, UK, and India only
countries_to_include = ['United States', 'United Kingdom', 'India']
filtered_df = results_df[results_df['Country'].isin(countries_to_include)]

# Group data by theme (assuming you have 'df_themes' with 'location', 'theme', and 'word' columns)
grouped_data = filtered_df.groupby('Theme')

# Create a dictionary to hold the top keywords for each theme
top_keywords_per_theme = {}

# Iterate through each theme group
for theme, group in grouped_data:
    # Find the top 3 keywords for this theme based on frequency
    top_keywords = group['Keyword'].value_counts().head(3).index.tolist()
    top_keywords_per_theme[theme] = top_keywords

    for _, row in group.iterrows():
        country = row['City']
        theme_word = row['Keyword']

        try:
            # Filter cities based on country
            if row['Country'] == 'United States' and country in us_cities:
                location = geolocator.geocode(country)
            elif row['Country'] == 'United Kingdom' and country in uk_cities:
                location = geolocator.geocode(country)
            elif row['Country'] == 'India' and country in indian_cities:
                location = geolocator.geocode(country)
            else:
                location = None

            if location:
                # Prepare popup content
                top_keywords_text = "<br>".join([f"Top Keywords: {kw}" for kw in top_keywords])
                popup_content = f"<b>{theme}</b><br>{top_keywords_text}<br>Location: {country}"

                # Add marker to cluster with theme color
                folium.CircleMarker(
                    location=[location.latitude, location.longitude],
                    radius=8,
                    color=get_theme_color(theme),
                    fill=True,
                    fill_color=get_theme_color(theme),
                    fill_opacity=0.8,
                    tooltip=folium.Tooltip(f"{theme}: {theme_word}"),
                    popup=folium.Popup(popup_content, max_width=300),
                    weight=2
                ).add_to(marker_cluster)
        except Exception as e:
            print(f"Error processing country {country}: {e}")

# Add the legend to the map with styling like in the second code
legend_html = """
<div style="position: fixed;
            bottom: 50px; left: 50px; width: 250px; height: auto;
            background-color: white; border: 2px solid grey; z-index: 9999; font-size: 14px;
            padding: 10px; border-radius: 10px; box-shadow: 2px 2px 5px rgba(0, 0, 0, 0.3);">
    <b>Theme Legend:</b><br>
    <div><i style="background: #0074D9; padding: 5px; color: white; border-radius: 5px;">&nbsp;&nbsp;&nbsp;</i> Family</div>
    <div><i style="background: #FF69B4; padding: 5px; color: white; border-radius: 5px;">&nbsp;&nbsp;&nbsp;</i> Love</div>
    <div><i style="background: #FF6347; padding: 5px; color: white; border-radius: 5px;">&nbsp;&nbsp;&nbsp;</i> War</div>
    <div><i style="background: #8A2BE2; padding: 5px; color: white; border-radius: 5px;">&nbsp;&nbsp;&nbsp;</i> Politics</div>
    <div><i style="background: #FFA500; padding: 5px; color: white; border-radius: 5px;">&nbsp;&nbsp;&nbsp;</i> Crime</div>
    <div><i style="background: #32CD32; padding: 5px; color: white; border-radius: 5px;">&nbsp;&nbsp;&nbsp;</i> Nature</div>
    <div><i style="background: #9400D3; padding: 5px; color: white; border-radius: 5px;">&nbsp;&nbsp;&nbsp;</i> Religious</div>
    <div><i style="background: #228B22; padding: 5px; color: white; border-radius: 5px;">&nbsp;&nbsp;&nbsp;</i> Tourism</div>
    <div><i style="background: #ADD8E6; padding: 5px; color: white; border-radius: 5px;">&nbsp;&nbsp;&nbsp;</i> Friendship</div>
    <div><i style="background: #20B2AA; padding: 5px; color: white; border-radius: 5px;">&nbsp;&nbsp;&nbsp;</i> History</div>
    <div><i style="background: #FFD700; padding: 5px; color: black; border-radius: 5px;">&nbsp;&nbsp;&nbsp;</i> Science</div>
    <div><i style="background: #4B0082; padding: 5px; color: white; border-radius: 5px;">&nbsp;&nbsp;&nbsp;</i> Fantasy</div>
    <div><i style="background: #DAA520; padding: 5px; color: white; border-radius: 5px;">&nbsp;&nbsp;&nbsp;</i> Recovery</div>
    <div><i style="background: #FF1493; padding: 5px; color: white; border-radius: 5px;">&nbsp;&nbsp;&nbsp;</i> Sex & Nudity</div>
</div>
"""
world_map.get_root().html.add_child(folium.Element(legend_html))

# Save the map to an HTML file
world_map.save("themed_world_map.html")
