# Arlington2050 Project Summary
- Mr. Jones and our class at Arlington Tech participated in the Arlington 2050 project.
- People gave their responses on what they think Arlington will be like in 2050.
- Specifically, we used the responses from the county fair.
- Since we were given many responses, our class used math and coding to process the data into meaningful data and visualizations.

## The Process
- We specifically used Python to code with the data.
- We used a tool called Pandas, which allows Python to interact with Excel files and databases.
We can import Pandas:

In [1]:
import pandas as pd

We can load the Excel file and assign it to a variable.

In [None]:
array1 = pd.read_excel("CountyFair.xlsx")

Now we can rename the columns of the table, making it easier to work with.

In [None]:
ds = array1.rename(columns={ "Unnamed: 1": "Year2050", "Unnamed: 2": "Translation1", "Unnamed: 3": "Getting_Here","Unnamed: 4": "Translation2"})

Now we can begin processing the data and eventually visualize it.
- We can import spacy, a Python library that can process data in the way we want it to.

In [None]:
import spacy
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
from spacytextblob.spacytextblob import SpacyTextBlob

Some responses were in Spanish, so we used this code to replace them with English translations:

In [None]:
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('spacytextblob')

string_list = ds['Year2050'].tolist()
spanish_list = ds['Translation1'].tolist()
IndexCounter = 0
for n in spanish_list: # Gets non null values from the spanish translated list, if they're not null then it appends to the corresponding index of the main list.
    workingstring = str(n)
    if workingstring != 'nan':
        string_list[IndexCounter] = workingstring
    IndexCounter += 1

Next, we are going to create a word cloud so we can see the most common words among the data.
- Firstly, we will put the most common words in a list as their own item.
- Whilst putting these words in the list, we filter out stop words (words like An, And, The, etc) and punctuation (!, ?, etc).

In [None]:
text = ds['Year2050'].str.cat(sep='')
doc = nlp(text)
words = [token.text for token in doc if not token.is_stop and not token.is_punct]

Now, we can use create a word cloud using a word cloud library we imported earlier with the list we made.
> Word cloud code from Alex Elliott.

In [None]:
wordcloud = WordCloud(width=800, height=400, background_color='white', max_words=100, mask=None, contour_width=3, contour_color='steelblue').generate(" ".join(words))

plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

With this word cloud, we can see major themes from the responses, such as housing, parks, community, etc.

We can also create histograms to visualize this data.
- In our class, we created histograms showing the polarities and subjectivities of the responses (polarity meaning how positive or negative a response was).
- First, lets create two lists, one with each of the responses' polarity, and each of the responses' subjectivity.

In [None]:
pol_list = []
sub_list = []
t = 0
for t in range(2,len(string_list)):
    text = string_list[t]
    doc = nlp(text)
    pol_list.append(doc._.blob.polarity)
    sub_list.append(doc._.blob.subjectivity)

Now, we can import more libraries that are used to create the histogram.

In [None]:
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Now we can make the subjectivity histogram:

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(data=sub_list)
plt.title('Subjectivity of Postcard Responses from County Fair')
plt.xlabel('In a range from 0 to 1')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

And the polarity histogram:

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(data=pol_list)
plt.title('Polarity of Postcard Responses from County Fair')
plt.xlabel('In a range from -1 to 1')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

## Two other methods for data processing and visualization

### Dimensionality Reduction (From Alex)
- Dimensionality reduction is a method for representing a given dataset using a lower number of features (i.e. dimensions) while still capturing the original data's meaningful properties.
- The reason we do this is to remove irrelevant or redundant features, or simply noisy data, to create a model with a lower number of variables.

### Vector Embeddings (From Blu)
- Vector embedding is a language model technique that maps words to vectors of real numbers
- It represents words in vector space in several different dimensions

An example seen from Blu, vectorization can be used to find outliers:

In [None]:
import spacy # importing spacy

nlp = spacy.load("en_core_web_lg") 
tokens = nlp("dog cat banana afskfsd")

for token in tokens:
   print(token.text, token.has_vector, token.vector_norm, token.is_oov)

## Summary
Though what we learned can be very deep, with some things needed to be going to college for to fully master them, Arlington2050 still taught our class how to take words and sentences, and mathematically represent them. This also teaches us a little about AI, as AI uses vectors to understand words and phrases. In my opinion, this was a great introduction to language processing and data visualization, and actually used real world applications to teach.