In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv('../data/processed/nuforc_clean_plusfeatures.csv')
df.head()

Unnamed: 0,datetime,city,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude,comments_shapes,...,minute,second,date_posted,time_since_event,comment_length_words,comment_length_characters,comment_unique_words,flesch_kincaid_score,comments_colors,comments_modecolor
0,1949-10-10 20:30:00,san marcos,cylinder,2700.0,45 minutes,This event took place in early fall around 194...,4/27/2004,29.883056,-97.941111,round,...,30,0,2004-04-27,19922 days 03:30:00,24,135,22,2.9,,
1,1949-10-10 21:00:00,lackland afb,light,7200.0,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,12/16/2005,29.38421,-98.581082,cross,...,0,0,2005-12-16,20520 days 03:00:00,17,95,17,3.1,,
2,1955-10-10 17:00:00,chester (uk/england),circle,20.0,20 seconds,Green/Orange circular disc over Chester&#44 En...,1/21/2008,53.2,-2.916667,,...,0,0,2008-01-21,19095 days 07:00:00,6,51,6,8.0,,
3,1956-10-10 21:00:00,edna,circle,20.0,1/2 hour,My older brother and twin sister were leaving ...,1/17/2004,28.978333,-96.645833,other,...,0,0,2004-01-17,17264 days 03:00:00,26,138,25,4.8,,
4,1960-10-10 20:00:00,kaneohe,light,900.0,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,1/22/2004,21.418056,-157.803611,,...,0,0,2004-01-22,15808 days 04:00:00,25,154,22,8.2,,


Let's look at how this data is structured.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71632 entries, 0 to 71631
Data columns (total 29 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   datetime                   71632 non-null  object 
 1   city                       71632 non-null  object 
 2   shape                      71632 non-null  object 
 3   duration (seconds)         71632 non-null  float64
 4   duration (hours/min)       71632 non-null  object 
 5   comments                   71632 non-null  object 
 6   date posted                71632 non-null  object 
 7   latitude                   71632 non-null  float64
 8   longitude                  71632 non-null  float64
 9   comments_shapes            71632 non-null  object 
 10  calculated_duration        71632 non-null  object 
 11  duration_value             71632 non-null  float64
 12  duration_unit              71632 non-null  object 
 13  total_seconds              71632 non-null  flo

Great, it looks like the dataset is already cleaned and ready for modeling. The first step I would take is to create a train and test dataset for our model. I would use the train_test_split function from scikit-learn to split the dataset into a training set and a test set.

Next, I would choose the appropriate machine learning algorithm for the task. Since this is a classification problem, where we are trying to classify the shape of the UAP based on other features, I would probably use a decision tree or random forest classifier.

I would then fit the model to the training data and make predictions on the test data. I would also use metrics such as accuracy, precision, and recall to evaluate the performance of the model.

Finally, I would tune the parameters of the model using techniques such as cross-validation and grid search to improve its performance. I would also consider using other techniques such as feature selection and dimensionality reduction to improve the performance of the model.

Overall, my approach would be to use a decision tree or random forest classifier, fit the model to the training data, make predictions on the test data, and evaluate the performance of the model. I would then tune the parameters of the model and use other techniques to improve its performance.





In [6]:
# To model this data, we will use a combination of approaches.
# Use the text data collected along with the dates of the reports, geographic locations of the reports, and the shapes of the reported UAPs, and build a model to cluster the reports, considering them like points on a plane. The plane will have 3 dimensions: time, location (lat,long). The clusters will be the groups of reports that are similar to each other. This introduces another dimension to the plane: the shape of the UAPs. The clusters will be the groups of reports that are similar to each other after the text has been vectorized with TFIDF and the shape of the UAPs has been one-hot encoded.

import pandas as pd
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import OneHotEncoder

# import the dataset
data = pd.read_csv('../data/processed/nuforc_clean_plusfeatures.csv')
data.head(2)

Unnamed: 0,datetime,city,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude,comments_shapes,...,minute,second,date_posted,time_since_event,comment_length_words,comment_length_characters,comment_unique_words,flesch_kincaid_score,comments_colors,comments_modecolor
0,1949-10-10 20:30:00,san marcos,cylinder,2700.0,45 minutes,This event took place in early fall around 194...,4/27/2004,29.883056,-97.941111,round,...,30,0,2004-04-27,19922 days 03:30:00,24,135,22,2.9,,
1,1949-10-10 21:00:00,lackland afb,light,7200.0,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,12/16/2005,29.38421,-98.581082,cross,...,0,0,2005-12-16,20520 days 03:00:00,17,95,17,3.1,,


In [7]:

# one-hot encode the shape of the UAPs
enc = OneHotEncoder()
shape_encoded = enc.fit_transform(data[['shape']])

# vectorize the text data using TF-IDF
vectorizer = TfidfVectorizer()
text_vectors = vectorizer.fit_transform(data['comments'])

# combine the encoded shape, text vectors, and location and time information
X = np.concatenate((shape_encoded.toarray(), text_vectors.toarray(), data[['latitude', 'longitude', 'year', 'month', 'day']].values), axis=1)

# perform k-means clustering
kmeans = KMeans(n_clusters=5)
kmeans.fit(X)


: 

: 

: 