# Introduction and Takeaways
This is a quick and dirty topic modeling analysis of the alt text in my corpus to check for any trends in what kinds of images are common on Twitter. Themes among the topics identified were:
- images of animals (cats, dogs)
- images of humans, especially selfies
- images of relating to current events (in December 2020, these included Christmas and Covid)
- images of text, including a lot of Spotify Wrapped screencaps

An important flaw with this approach is that it can only tell us what kinds of images exist *among those that already have alt text*, which only represents a small portion of all Twitter images, and is likely a biased sample.

# Import Packages and Data

In [7]:
# basics
import pandas as pd 
import numpy as np

# files
import glob
import pickle
import os
import requests
import sys

#images
from PIL import Image

# nlp
import nltk
nltk.download('averaged_perceptron_tagger')
from nltk.tag import pos_tag

sys.path.insert(0, '..')
from SpacyPreprocessor import SpacyPreprocessor
from TopicModelExploration import show_topics, show_top_docs

from sklearn.decomposition import NMF, TruncatedSVD, LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/labbot/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [27]:
%load_ext autoreload
%autoreload 2

In [6]:
df = pd.read_pickle('twitter_alt_text.pkl')

# Text Prepoprcessing

## lemmatize and remove numbers, symbols, POS with SpaCy

In [22]:
spacy_model = SpacyPreprocessor.load_model()

In [23]:
preprocessor = SpacyPreprocessor(spacy_model=spacy_model, lemmatize=True, remove_numbers=True, 
                                 remove_stopwords=False, remove_special=True, 
                                 pos_to_remove=['ADP','SYM','NUM','AUX'])
df['spacy_pipe'] = preprocessor.preprocess_text_list(list(df['alt_text']))

4881it [00:04, 983.23it/s]


In [24]:
#check how that worked
df.sample(3)

Unnamed: 0,id,created_at,tweet_text,tweet_url,img_url,alt_text,media_type,spacy_pipe
1140807,1334658662565109761,2020-12-04 00:39:56,Jack is perfect. https://t.co/mePMDha4ot,https://t.co/mePMDha4ot,http://pbs.twimg.com/media/EoWo-ihXYAgvPxy.jpg,Jack the cat sleeping peacefully,photo,jack the cat sleep peacefully
1213729,1335039542785662978,2020-12-05 01:53:25,"Twitter, would it be legal for me to drive my ...",https://t.co/7Xj8xueKz7,http://pbs.twimg.com/media/EocDYsaXMAA6Ckm.jpg,Giant red bow on top of shiny black car with a...,photo,giant red bow top shiny black car a man a flan...
486804,1334534352168894464,2020-12-03 16:25:58,@visakanv Black cardigan and a black scarf and...,https://t.co/AYLx8PryNA,http://pbs.twimg.com/media/EoU36kPWEAwLEft.jpg,Pink bag pack on green khaki jacket,photo,pink bag pack green khaki jacket


## check for and remove super short documents

In [25]:
df['spacy_pipe_len'] = df['spacy_pipe'].str.split(" ").str.len()
df['spacy_pipe_len'].describe()

count    4881.000000
mean       17.015775
std        22.858222
min         1.000000
25%         5.000000
50%        10.000000
75%        20.000000
max       291.000000
Name: spacy_pipe_len, dtype: float64

In [26]:
df = df[df['spacy_pipe_len'] >= 5]

# Fit Vectorizer and Topic Model
I did minimal iteration with different pipelines here. A future step would be to set up a more robust pipeline and iterate through different vectorizers, models, and hyperparameters.

In [33]:
# define corpus as the spacy-processed version of the data
corpus = df['spacy_pipe']

In [34]:
# define stopwords
stop_words = spacy_model.Defaults.stop_words

# add custom stopwords to stopwords list
custom_stopwords = ["pron"]

for s in custom_stopwords:
    stop_words.add(s)

In [32]:
# define parameters for the vectorizer
params = {
    'stop_words':stop_words,
    'min_df':10,
    'ngram_range':(1, 2)
}

In [35]:
# fit vectorizer
vectorizer = CountVectorizer(**params)
doc_word_matrix = vectorizer.fit_transform(corpus)
doc_word_matrix.shape



(3716, 743)

In [36]:
# create and fit decomposition model
nmf = NMF(n_components=12,random_state=7)

# create the document-topic matrix
doc_topic_matrix = nmf.fit_transform(doc_word_matrix)

# create columns names
topicnames = ['Topic_' + str(i) for i in range(nmf.n_components)]

# index names
docnames = ['AltText_' + str(i) for i in range(len(corpus))]

# create a dataframe
df_doc_topic = pd.DataFrame(np.round(doc_topic_matrix,4), columns=topicnames, index=docnames)

df_doc_topic.head()

Unnamed: 0,Topic_0,Topic_1,Topic_2,Topic_3,Topic_4,Topic_5,Topic_6,Topic_7,Topic_8,Topic_9,Topic_10,Topic_11
AltText_0,0.0077,0.005,0.0098,0.0413,0.0048,0.0056,0.0066,0.0,0.0267,0.0094,0.0018,0.0091
AltText_1,0.0,0.2578,0.0,0.0155,0.0054,0.0,0.0,0.0,0.0143,0.0044,0.0,0.0143
AltText_2,0.0002,0.001,0.0019,0.005,0.0007,0.0,0.0,0.0,0.0003,0.0005,0.0,0.0022
AltText_3,0.0,0.0018,0.0026,0.0,0.0,0.032,0.0,0.0,0.0,0.0033,0.0,0.0009
AltText_4,0.0027,0.0044,0.0005,0.0006,0.0009,0.0,0.0,0.0053,0.0072,0.0071,0.0125,0.0173


# Explore and Interpret Topics

In [39]:
topic_keywords = show_topics(vectorizer, nmf, 15)
topic_keywords

[array(['blue', 'red', 'wear', 'background', 'green', 'pink', 'dark',
        'hair', 'light', 'yellow', 'purple', 'brown', 'eye', 'tree',
        'grey'], dtype='<U18'),
 array(['text', 'image', 'read', 'image text', 'automatic',
        'automatic image', 'text read', 'background', 'right', 'view',
        'pay', 'page', 'book', 'set', 'item'], dtype='<U18'),
 array(['com', 'https', 'www', 'https www', 'instagram', 'twitter',
        'follow', 'video', 'group', 'study', 'twitter com', 'office',
        'join', 'comment', 'shot'], dtype='<U18'),
 array(['like', 'love', 'know', 'time', 'thing', 'number', 'people',
        'good', 'feel', 'care', 'word', 'day', 'want', 'look like',
        'small'], dtype='<U18'),
 array(['black', 'wear', 'wear black', 'hair', 'pink', 'shirt',
        'black white', 'silver', 'tie', 'man', 'people', 'jacket',
        'purple', 'dress', 'french'], dtype='<U18'),
 array(['screenshot', 'tweet', 'read', 'spotify', 'song', 'picture',
        'wrap', 'spotify

In [41]:
# add original comments text back into doc_topic matrix
df_doc_topic['orig_comments'] = df['alt_text'].values
df_doc_topic.sample(2)

Unnamed: 0,Topic_0,Topic_1,Topic_2,Topic_3,Topic_4,Topic_5,Topic_6,Topic_7,Topic_8,Topic_9,Topic_10,Topic_11,orig_comments
AltText_1917,0.0,0.1016,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Page image from Opere del cardinale Pietro Bembo
AltText_23,0.0097,0.0,0.004,0.0019,0.0012,0.0008,0.0026,0.0,0.0157,0.0,0.0119,0.0,Guildy's head and tail stick out from behind t...


In [42]:
topic_docs = show_top_docs(df_doc_topic, 'Topic_11',7)
topic_docs

array(['using System;\nusing System.Collections.Generic;\nusing System.ComponentModel;\nusing System.Data;\nusing System.Drawing;\nusing System.Linq;\nusing System.Text;\nusing System.Threading.Tasks;\nusing System.Windows.Forms;\n\nnamespace WindowsFormsApplication6\n{\n    public partial class Form1 : Form\n    {\n        private MyFont _font;\n        public Form1()\n        {\n            InitializeComponent();\n        }\n\n        priv...',
       'Your next steps depend on the total of the Used column from the df -h command above.      If you’re using less space than your intended plan requires, you can move onto the next step without any further action.     If you’re using more space than your intended plan allows, you need to remove some files to free up some space before moving onto the next step. See the options for doing this in the Download Files from Your Linode guide.  Before resizing your Linode to a new plan, you need to resize the disk to match the storage volume of t