# Yachay.ai

## Purpose

Yachay is an open-source machine learning community with decades worth of natural language data from media, the dark web, legal proceedings, and government publications. They have cleaned and annotated the data, and created a geolocation detection tool. They are looking for developers interested in contributing and improving on the project. We are given a dataset of tweets, and another dataset of coordinates, upon which we will create a neural network to predict coordinates from text. 

## Introduction

### Main Dataset

In [None]:
# read dataset
df_main = pd.read_csv('data\Main_Dataset.csv', parse_dates=['timestamp'], index_col=['timestamp'])

In [None]:
# sort by timestamp
df_main.sort_index(inplace=True)

In [None]:
# look at dataset
df_main.head()

In [None]:
df_main.index.is_monotonic

In [None]:
# look at column information
df_main.info()

In [None]:
# looking for missing values
df_main.isna().sum()

In [None]:
# looking for duplicates
df_main.duplicated().sum()

In [None]:
# data with missing index
df_main.index.isna().sum()

In [None]:
# percentage of data with missing index
df_main.index.isna().sum() / len(df_main) * 100

In [None]:
# looking at missing data with missing index
df_main[df_main.index.isna()]

Overall, the main dataset is fairly clean. We loaded the data as a timeseries, and parsed the dates. This dataframe contains most of the features we need to train a our model. The data that is missing is limited to timestamps, while the other columns of this data is present. As the missing data represents 2% of the entire dataset, and becase we are unable to impute the timestamps, we will drop these rows. 

### Cluster Coordinates

In [None]:
# load cluster data
df_cl = pd.read_csv('data/Clusters_Coordinates.csv')

In [None]:
# look at dataset
df_cl.head()

In [None]:
# looking at column info
df_cl.info()

In [None]:
# looking for missing values
df_cl.isna().sum()

Cluster coordinates dataframe contains the cluster id as well as the latitutde and longitude data. This dataframe is clean with no missing values. We will merge the two dataframes before conducting EDA. 

## Feature Engineering

In [None]:
# visual of data before feature engineering
df_main.head()

In [None]:
# Making timestamp features
def make_features(data):
    data['year'] = data.index.year
    data['month'] = data.index.month
    data['week'] = data.index.isocalendar().week
    data['day'] = data.index.day
    data['day_of_week'] = data.index.day_of_week 
    data['day_of_year'] = data.index.day_of_year
    data['hour'] = data.index.hour 
    data['minute'] = data.index.minute 
    data['second'] = data.index.second
    
make_features(df_main)

In [None]:
# new features added
df_main.head()

In [None]:
# merge main and cluster coordinates
df = df_main.merge(df_cl, on='cluster_id', sort=True)

In [None]:
# new merged dataset
df.head()

In [None]:
# drop missing values
df.dropna(inplace=True)

In [None]:
# missing values
df.isna().sum()

In [None]:
# shape of dataset
df.shape

In [None]:
# df.to_csv('processed data/df.csv', index=False)

We merged the datasets on cluster id. We then dropped all rows with the missing timestamp data. We are left with a total of close to 600,000 rows of data. 

## EDA

In [None]:
# summary statistics
df.describe()

In [None]:
# number of unique users
df.user_id.nunique()

In [None]:
# number of unique clusters
df.cluster_id.nunique()

In [None]:
# number of unique latitudes
df.lat.nunique()

In [None]:
# number of unique longitudes
df.lng.nunique()

In [None]:
# skew of data
df.skew()

In [None]:
# correlation of data
px.imshow(df.corr(), text_auto=True, aspect='auto')

In [None]:
# distributions of columns
columns = ['month', 'week', 'day', 'day_of_week', 'day_of_year', 'hour', 'minute', 'second']
for column in columns:
    px.histogram(df[column], title='Distribution of '+ str.upper(column).replace('_', ' '), labels={'value': str(column).replace('_', ' ')}).show()

In [None]:
px.histogram(df.user_id.value_counts(), title='Distribution of User Id\'s')

In [None]:
px.histogram(df.cluster_id.value_counts(), title='Distribution of Cluster Id\'s')

We see that the data is distributed towards two months out of the year: near the beginning and more near the end. We can see this trend in the monthly and year of the week timeframes. Along the daily timeframe, the data is uniformly distributed up until the last week of the month, at which point it declines to roughly half the mean of the month. The day of the week is distributed normally, so there are no differences seen during the week versus the weekned. The distribution in the time of day is mostly uniform, except for the gradual decline in the middle of the day immediately followed by the gradual increase back to uniform distribution. The lower timeframes of minutes and seconds are uniformly distributed. 

User id and cluster id are right skewed. 

### Distribution of Coordinates

> maps in maps notebook

The data is concentrated in North America, with most coordinates in the US and Mexico. We see most tweets appear in popular cities in the east coast and the west coast, with clusters in New York, California, and Florida. We see a few tweets originate from Alaska, Canada, Hawaii, and the Caribbean. 

The heatmap further illustrates the distribution of tweets in the east coast of the US.

## NLP

The NLP tools we will be using are from Huggingface. We will test several different models to determine which one would work best with our dataset. We narrowed the options to BERT base, BERT multilingual, and XLM Roberta. Preprocessing will be similar among the three models, while the multilingual and XLM are optimized for english nd many other languages. We anticipate the multi language models to perform better than the base model, as our text data contains many different languages. 

In [None]:
# looking through tweets
tweets = df.text.tolist()
tweets[10:21]

### BERT Base

In [None]:
# function to preprocess data for modelling 
def preprocess_base(df, max_sample, batch_size=200):
    max_sample_size = max_sample # set the max sample size

    # preprocessing and BERT
    tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-uncased')

    ids_list_df = []
    attention_mask_list_df = []

    max_length = 512

    for input_text in df.iloc[:max_sample_size]['text']:
        ids = tokenizer.encode(input_text.lower(), add_special_tokens=True, truncation=True, max_length=max_length)
        padded = np.array(ids + [0]*(max_length - len(ids)))
        attention_mask = np.where(padded != 0, 1, 0)
        ids_list_df.append(padded)
        attention_mask_list_df.append(attention_mask)
    
    # get embeddings 
    config = transformers.BertConfig.from_pretrained('bert-base-uncased')
    model = transformers.BertModel.from_pretrained('bert-base-uncased')

    batch_size = batch_size    # typically the batch size is equal to 100 but we can set it to lower values to lower the memory requirements

    embeddings_df = []

    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')  # will use cpu unless cuda is available
    print(f'Using the {device} device.')
    model.to(device)

    for i in tqdm(range(len(ids_list_df) // batch_size)):
        
        ids_batch_df = torch.LongTensor(ids_list_df[batch_size*i:batch_size*(i+1)]).to(device)
        attention_mask_batch_df = torch.LongTensor(attention_mask_list_df[batch_size*i:batch_size*(i+1)]).to(device)

        with torch.no_grad():
            model.eval()
            batch_embeddings = model(ids_batch_df, attention_mask=attention_mask_batch_df)

        embeddings_df.append(batch_embeddings[0][:,0,:].detach().cpu().numpy())

    X = np.concatenate(embeddings_df)  # create features
    y = df.iloc[:max_sample_size][['lat', 'lng']] # create target with matching length as features

    print(X.shape)  # illustrate matching length
    print(y.shape)   # illustrate matching length

    return X, y  # return the processed features and target dataframe


In [None]:
# processing training data 
#X_base, y_base = preprocess_base(df[591408:], len(df[591408:]), 4)

In [None]:
# np.savetxt("processed data/X_base.csv", X_base, delimiter=",")
# y_base.to_csv('processed data/y_base.csv', header=False, index=False)

#### Multilingual

In [None]:
# function to preprocess data for modelling 
def preprocess_multi(df, max_sample, batch_size=200):
    max_sample_size = max_sample # set the max sample size

    # preprocessing and BERT
    tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-multilingual-uncased')

    ids_list_df = []
    attention_mask_list_df = []

    max_length = 512

    for input_text in df.iloc[:max_sample_size]['text']:
        ids = tokenizer.encode(input_text.lower(), add_special_tokens=True, truncation=True, max_length=max_length)
        padded = np.array(ids + [0]*(max_length - len(ids)))
        attention_mask = np.where(padded != 0, 1, 0)
        ids_list_df.append(padded)
        attention_mask_list_df.append(attention_mask)
    
    # get embeddings 
    config = transformers.BertConfig.from_pretrained('bert-base-multilingual-uncased')
    model = transformers.BertModel.from_pretrained('bert-base-multilingual-uncased')

    batch_size = batch_size    # typically the batch size is equal to 100 but we can set it to lower values to lower the memory requirements

    embeddings_df = []

    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')  # will use cpu unless cuda is available
    print(f'Using the {device} device.')
    model.to(device)

    for i in tqdm(range(len(ids_list_df) // batch_size)):
        
        ids_batch_df = torch.LongTensor(ids_list_df[batch_size*i:batch_size*(i+1)]).to(device)
        attention_mask_batch_df = torch.LongTensor(attention_mask_list_df[batch_size*i:batch_size*(i+1)]).to(device)

        with torch.no_grad():
            model.eval()
            batch_embeddings = model(ids_batch_df, attention_mask=attention_mask_batch_df)

        embeddings_df.append(batch_embeddings[0][:,0,:].detach().cpu().numpy())

    X = np.concatenate(embeddings_df)  # create features
    y = df.iloc[:max_sample_size][['lat', 'lng']] # create target with matching length as features

    print(X.shape)  # illustrate matching length
    print(y.shape)   # illustrate matching length
    
    return X, y  # return the features and target dataframes

In [None]:
# processing training data 
# X_multi, y_multi = preprocess_multi(df, 591412, 296)

In [None]:
# X_multi = pd.DataFrame(X_multi)
# X_multi.to_csv('/notebooks/X_base.csv', index=False)

In [None]:
# np.savetxt("processed data/X_multi.csv", X_base, delimiter=",")
# y_multi.to_csv('processed data/y_multi.csv', header=False, index=False)

### XLM 

In [None]:
# function to preprocess data for modelling 
def preprocess_xlm(df, max_sample, batch_size=200):
    max_sample_size = max_sample # set the max sample size

    # preprocessing and XLM-RoBERTa
    tokenizer = transformers.XLMRobertaTokenizerFast.from_pretrained('xlm-roberta-large')

    ids_list_df = []
    attention_mask_list_df = []

    max_length = 512

    for input_text in df.iloc[:max_sample_size]['text']:
        ids = tokenizer.encode(input_text.lower(), add_special_tokens=True, truncation=True, max_length=max_length)
        padded = np.array(ids + [0]*(max_length - len(ids)))
        attention_mask = np.where(padded != 0, 1, 0)
        ids_list_df.append(padded)
        attention_mask_list_df.append(attention_mask)
    
    # get embeddings 
    config = transformers.XLMRobertaConfig.from_pretrained('xlm-roberta-large')
    model = transformers.XLMRobertaModel.from_pretrained('xlm-roberta-large')

    batch_size = batch_size    # typically the batch size is equal to 100 but we can set it to lower values to lower the memory requirements

    embeddings_df = []

    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')  # will use cpu unless cuda is available
    print(f'Using the {device} device.')
    model.to(device)

    for i in tqdm(range(len(ids_list_df) // batch_size)):
        
        ids_batch_df = torch.LongTensor(ids_list_df[batch_size*i:batch_size*(i+1)]).to(device)
        attention_mask_batch_df = torch.LongTensor(attention_mask_list_df[batch_size*i:batch_size*(i+1)]).to(device)

        with torch.no_grad():
            model.eval()
            batch_embeddings = model(ids_batch_df, attention_mask=attention_mask_batch_df)

        embeddings_df.append(batch_embeddings[0][:,0,:].detach().cpu().numpy())

    X = np.concatenate(embeddings_df)  # create features
    y = df.iloc[:max_sample_size][['lat', 'lng']] # create target with matching length as features

    print(X.shape)  # illustrate matching length
    print(y.shape)   # illustrate matching length
    
    return X, y  # return the features and target dataframes

We preprocessed the data into three different sets of training and test data. We wil run our models on each of the three sets to determine which NLP model worked best for our data. 

## Sample of larger dataset

With limited resources, we can not work with the entire dataset of half a million rows. Our approach is to take a manageable sample of the data to work with. We will save that dataset, to reduce computation times. 

> Need to get a random sample of dataframe before tokenization, to ensure we get a much closer representation of the distribution of coordinates. current coordinates of non-random dataset is Los Angeles, CA.
>

In [None]:
#np.savetxt("processed data/X_xlm_1000.csv", X_xlm, delimiter=",")
#y_xlm.to_csv('processed data/y_xlm_1000.csv', header=False, index=False)

In [None]:
df = pd.read_csv('df.csv')

## Loading Processed Data

### Base Model

In [None]:
X_base = pd.read_csv('inputs/X_base.csv')

In [None]:
X_base.shape

### Multilingual Model

In [None]:
X_multi = pd.read_csv('inputs/X_multi.csv')

In [None]:
X_multi.shape

### XLM Model

In [None]:
X_xlm = pd.read_csv('inputs/X_xlm.csv')

In [None]:
X_xlm.shape

### Target Dataframe

In [None]:
y = pd.read_csv('inputs/y.csv')

In [None]:
y.head()

## Neural Network

In [None]:
# train test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=19) # split 20% of data to make validation set

In [None]:
# haversine distance loss
RADIUS_KM = 6378.1

def degrees_to_radians(deg):
    pi_on_180 = 0.017453292519943295
    return deg * pi_on_180

def loss_haversine(observation, prediction):    
    obv_rad = tf.map_fn(degrees_to_radians, observation)
    prev_rad = tf.map_fn(degrees_to_radians, prediction)

    dlon_dlat = obv_rad - prev_rad 
    v = dlon_dlat / 2
    v = tf.sin(v)
    v = v**2

    a = v[:,1] + tf.cos(obv_rad[:,1]) * tf.cos(prev_rad[:,1]) * v[:,0] 

    c = tf.sqrt(a)
    c = 2* tf.math.asin(c)
    c = c*RADIUS_KM
    final = tf.reduce_sum(c)

    #if you're interested in having MAE with the haversine distance in KM
    #uncomment the following line
    final = final/tf.dtypes.cast(tf.shape(observation)[0], dtype= tf.float32)

    return final

In [None]:
tf.random.set_seed(19)
optimizer = Adam(learning_rate=.0001)
callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=5)
reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.1,
                              patience=5, min_lr=0.0000001)

# define the model architecture
model = Sequential()
model.add(Dense(4000, activation='relu', input_dim=(X_train.shape[1])))
model.add(Dense(2000, activation='relu'))
model.add(Dense(2)) # output layer with 2 units for latitude and longitude

# compile the model
model.compile(optimizer=optimizer, loss=loss_haversine, metrics=['mse'])

# train the model
with tf.device('/GPU:0'):
    history = model.fit(X_train, y_train, epochs=1, batch_size=32, validation_split=0.10, callbacks=[callback, reduce_lr], use_multiprocessing=True)

In [None]:
tf.random.set_seed(19)
optimizer = Adam(learning_rate=.0001)
callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=5)
reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.1,
                              patience=5, min_lr=0.0000001)

# define the model architecture
model = Sequential()
model.add(Dense(8000, activation='relu', input_dim=(X_train.shape[1])))
model.add(Dense(4000, activation='relu'))
model.add(Dense(2000, activation='relu'))
model.add(Dense(1000, activation='relu'))
model.add(Dense(500, activation='relu'))
model.add(Dense(2)) # output layer with 2 units for latitude and longitude

# compile the model
model.compile(optimizer=optimizer, loss=loss_haversine, metrics=['mse'])

# train the model
with tf.device('/GPU:0'):
    history = model.fit(X_train, y_train, epochs=5, batch_size=32, validation_split=0.10, callbacks=[callback, reduce_lr], use_multiprocessing=True)


#### Convolutional Neural Network

In [None]:
from tensorflow.keras.layers import Conv1D, Flatten, Dense, MaxPooling1D, Reshape

In [None]:
model = Sequential()
model.add(Reshape((X_train.shape[1], 1), input_shape=(X_train.shape[1],)))
model.add(Conv1D(filters=32, kernel_size=3, activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Conv1D(filters=64, kernel_size=3, activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(2))
model.compile(optimizer=optimizer, loss=loss_haversine, metrics=['mse'])
history = model.fit(X_train, y_train, epochs=1, batch_size=32, validation_split=0.10, callbacks=[callback, reduce_lr])


#### LSTM

In [None]:
from keras.layers import LSTM, Dropout

In [None]:
tf.random.set_seed(19)
optimizer = Adam(learning_rate=.0001)
callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=5)
reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.1,
                              patience=5, min_lr=0.0000001)

# define the model architecture
model = Sequential()
model.add(LSTM(10, input_shape=(X_train.shape[1], 1), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(5, return_sequences=False))
model.add(Dropout(0.2))
model.add(Dense(2)) # output layer with 2 units for latitude and longitude

# compile the model
model.compile(optimizer=optimizer, loss=loss_haversine, metrics=['mse'])

# train the model
with tf.device('/GPU:0'):
    history = model.fit(X_train.values.reshape((X_train.shape[0], X_train.shape[1], 1)), y_train, epochs=1, batch_size=32, validation_split=0.10, callbacks=[callback, reduce_lr], use_multiprocessing=True)


In [None]:
model = Sequential()
model.add(LSTM(units=128, input_shape=(X_train.shape[1], 1)))
model.add(Dense(2))
model.compile(optimizer=optimizer, loss=loss_haversine, metrics=['mse'])
history = model.fit(X_train, y_train, epochs=5, batch_size=32, validation_split=0.10, callbacks=[callback, reduce_lr])


#### GRU

In [None]:
import tensorflow as tf
from tensorflow.keras.layers import Input, GRU, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam

# set random seed for reproducibility
tf.random.set_seed(19)

# define the inputs
inputs = Input(shape=(X_train.shape[1],))

# GRU layer with 64 units and default activation function (tanh)
gru = GRU(64)(inputs)

# dense layers with relu activation function
dense1 = Dense(128, activation='relu')(gru)
dense2 = Dense(64, activation='relu')(dense1)

# output layer with 2 units for latitude and longitude
outputs = Dense(2)(dense2)

# define the model
model = Model(inputs=inputs, outputs=outputs)

# compile the model
optimizer = Adam(learning_rate=0.0001)
model.compile(optimizer=optimizer, loss=loss_haversine, metrics=['mse'])

# train the model
history = model.fit(X_train, y_train, epochs=5, batch_size=32, validation_split=0.1, callbacks=[callback, reduce_lr])


In [None]:
# evaluation on test set
model.evaluate(X_test, y_test)

In [None]:
# X test predictions
preds = model.predict(X_test)


In [None]:
y_test.value_counts()

In [None]:
print(history.history.keys())

In [None]:
import plotly.express as px
import pandas as pd

# Convert the model history to a pandas DataFrame
df_his = pd.DataFrame(history.history)

# Create separate figures for loss and accuracy
fig_loss = px.line(df_his, x=df_his.index, y=['loss', 'val_loss'], labels={'value': 'Loss', 'index': 'Epoch'}, title='Model Loss')
fig_acc = px.line(df_his, x=df_his.index, y=['mse', 'val_mse'], labels={'value': 'MSE', 'index': 'Epoch'}, title='Model MSE')
fig_lr = px.line(df_his, x=df_his.index, y='lr', labels={'value': 'Learning Rate', 'index': 'Epoch'}, title='Model Learning Rate', log_y=True)

# Show the figures
fig_loss.show()
fig_acc.show()
fig_lr.show()

In [None]:
y_test.head()

In [None]:
preds_df = pd.DataFrame(preds, columns=['lat_p', 'lng_p'])

In [None]:
y_df = y_test.reset_index(drop=True)

In [None]:
coords = pd.concat([y_df, preds_df], axis=1)

In [None]:
coords

## Haversine distance

In [None]:
bsas = [34.020789, -118.411907]
paris = [34.087627, -118.664711]
bsas_in_radians = [radians(_) for _ in bsas]
paris_in_radians = [radians(_) for _ in paris]
result = haversine_distances([bsas_in_radians, paris_in_radians])
result * 6371000/1000  # multiply by Earth radius to get kilometers


In [None]:
# convert test set coordinates to radians   
y_test_rad = y_test * (math.pi/180)

In [None]:
# convert prediction coordinates to radians
preds_rad = preds * (math.pi/180)

In [None]:
# calculate distance
distances = haversine_distances(y_test_rad, preds_rad)[0]
distances_km = distances * (6371000/1000)

In [None]:
px.bar(distances_km, title='Distances Between Actual and Prediction', labels={'value': 'Distance (Km)'}, template='presentation')

In [None]:
px.box(distances_km, title='Distribution of Distances', labels={'value': 'Distance (Km)'}, template='plotly_white')

## Sentiment Analysis

In [None]:
#sentiment analysis
# from transformers import pipeline
# sent = pipeline('sentiment-analysis', model='cardiffnlp/twitter-xlm-roberta-base-sentiment')(df.text.values.tolist())
# sent = pd.DataFrame(sent, columns=['label', 'score'])

In [None]:
# load sentiment dataframe
sent = pd.read_csv('inputs/sent.csv')

In [None]:
# rename columns 
sent.rename(columns={'label': 'sent', 'score': 'sent_score'}, inplace=True)

In [None]:
# lok at sentiment dataframe
sent.head()

In [None]:
# sentiment labels
sent.sent.values.tolist()

In [None]:
# sentiment values
sent.sent.value_counts()

In [None]:
px.histogram(sent.sent, color=sent.sent,  title='Tweet Sentiment')

In [None]:
# one hot encoder column transformer
sent_encoder = make_column_transformer((OneHotEncoder(), ['sent']))

In [None]:
# one hot encode sentiment
sentiment_array = sent_encoder.fit_transform(sent)

In [None]:
# sentiment array
sentiment_array.shape

In [None]:
pd.DataFrame(sentiment_array).head()

In [None]:
sent_dummies = pd.get_dummies(sent.sent)
sent_dummies

## Language Detection

In [None]:
# Load the tokenizer and model
# tokenizer = AutoTokenizer.from_pretrained('ivanlau/language-detection-fine-tuned-on-xlm-roberta-base')
# model = AutoModelForSequenceClassification.from_pretrained('ivanlau/language-detection-fine-tuned-on-xlm-roberta-base')

# Set up the pipeline
# classifier = pipeline('text-classification', model=model, tokenizer=tokenizer, device_map='auto')

# Example usage
# result = []
# for text in tqdm(df.text.values.tolist()):
    # result.append(classifier(text))
# df_lan = pd.DataFrame(result, columns=['label', 'score'])


In [None]:
# load languages dataframe
lan = pd.read_csv('inputs/lan.csv')

In [None]:
lan.shape

In [None]:
# rename columns 
lan.rename(columns={'label': 'language', 'score': 'lang_score'}, inplace=True)

In [None]:
lan.head()

In [None]:
# counts of the different languages
lan_counts = lan.language.value_counts()

In [None]:
# visual of different language counts
px.bar(lan_counts, color=lan_counts.index, title='Tweet Languages')

In [None]:
# one hot encoder column transformer
lan_encoder = make_column_transformer((OneHotEncoder(), ['language']))

In [None]:
# one hot encode languages
lan_array = lan_encoder.fit_transform(lan)

In [None]:
lan_array.shape

In [None]:
pd.DataFrame(lan_array).head()

In [None]:
# one hot encoder column transformer
lan_encoder2 = make_column_transformer((LabelEncoder(), ['language']))

In [None]:
lang_dummies = pd.get_dummies(lan.language)
lang_dummies

In [None]:
encoder = LabelEncoder()
lang_labels = encoder.fit_transform(pd.DataFrame(lan.language))

In [None]:
lang_labels.shape

## Named Entity Recognition

In [None]:
# NER tokenizer
# token_classifier = pipeline(model="Abderrahim2/bert-finetuned-Location")

# tokens = token_classifier(df_rand.text.to_list())
# tokens= pd.DataFrame(tokens)


In [None]:
# load ner dataframe
ner = pd.read_csv('inputs/ner.csv')

In [None]:
# look at ner dataframe
ner.head()

In [None]:
ner_count = ner.entity.value_counts()

In [None]:
px.bar(ner_count, color=ner_count.index, title='Tweet Locations')

In [None]:
# one hot encoder column transformer
ner_encoder = make_column_transformer((OneHotEncoder(), ['entity']))

In [None]:
# one hot encode languages
ner_array = ner_encoder.fit_transform(ner)

In [None]:
ner_array.shape

In [None]:
# one hot encoder column transformer
ner_encoder2 = make_column_transformer((LabelEncoder(), ['entity']))

In [None]:
ner_dummies = pd.get_dummies(ner.entity)
ner_dummies

In [None]:
encoder = LabelEncoder()
ner_labels = encoder.fit_transform(pd.DataFrame(ner.entity))

In [None]:
ner_labels.shape

## Topic Classification

In [None]:
# topic classifier
# topic_classifier = pipeline(model="jonaskoenig/topic_classification_04")

# result1 = []
# for text in tqdm(df.text.values.tolist()):
#    result1.append(topic_classifier(text))

In [None]:
topics = pd.read_csv('inputs/topics.csv')

In [None]:
topics.columns=['topic', 'score']

In [None]:
topics.head()

In [None]:
topics.topic.values.tolist()

In [None]:
topics_count= topics.topic.value_counts()

In [None]:
px.bar(topics_count, color=topics_count.index, title='Tweet Topics')

In [None]:
# one hot encoder column transformer
topics_encoder = make_column_transformer((OneHotEncoder(), ['topic']))

In [None]:
topics_array = topics_encoder.fit_transform(topics)

In [None]:
topics_array.shape

In [None]:
topics_dummies = pd.get_dummies(topics.topic)
topics_dummies

In [None]:
encoder = LabelEncoder()
topic_labels = encoder.fit_transform(pd.DataFrame(topics.topic))

In [None]:
topic_labels.shape