<a href="https://colab.research.google.com/github/miyeonKim787/EV_Adoption/blob/main/Text_Analysis_EV.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Introduction**

The automotive landscape is undergoing a revolutionary transformation as society pivots towards sustainable and eco-friendly transportation solutions. In August 2022, Congress approved a sweeping reform of the electrical vehicles ("EV") tax credits as part of the $430 billion Inflation Reduction Act (IRA).

Consumers currently can take advantage of the $7,500 new EV credit or $4,000 used EV credit when they file their tax returns the following year, but starting in January 2024, consumers can transfer the credits to a car dealer, effectively lowering the vehicle’s purchase price.

But despite the benefits EVs offer - from government incenvites to contributing to a cleaner form of mobility - the widespread adoption of EVs has been met with various challenges, leading to a slower-than-expected transition from traditional internal combustion engine vehicles (ICE). For example, as reported by The Economist, "a poll published in July by the Pew Research Centre found that less than two-fifths of them would consider buying an electric vehicle."

This project attempts to understand the roadblocks to EV adoption, told from the perspective of potential customers, by collecting data via Youtube API and conducting TF-IDF and topic modelling analysis.

In [37]:
## Step 1 - Data Collection
import pandas as pd
from googleapiclient.discovery import build

# YouTube API credentials
api_key = 'AIzaSyDVzhKboeANlUp9j63Ga-SRg5fI7Tc2-jc'

# Construct YouTube client
youtube = build('youtube', 'v3', developerKey=api_key)

# Get video ID from URL
url = "https://www.youtube.com/watch?v=cZlsZwcIgpc"
video_id = url.split('=')[1]

# Initialize empty list and dataframe
comments = []
df = pd.DataFrame(columns=['date', 'author', 'comment'])

# Build initial API request object
request = youtube.commentThreads().list(
    part='snippet',
    videoId=video_id,
    maxResults = 100
)

# Iterate through API response to retrieve comments
while request:

    response = request.execute()

    for item in response['items']:

        # Extract comment data
        comment_id = item['id']
        date = item['snippet']['topLevelComment']['snippet']['publishedAt']
        author = item['snippet']['topLevelComment']['snippet']['authorDisplayName']
        text = item['snippet']['topLevelComment']['snippet']['textDisplay']

        # Construct comment dict
        comment = {'date': date,
                   'author': author, 'comment': text}

        # Append comment to list
        comments.append(comment)

    # Get next page token
    request = youtube.commentThreads().list_next(request, response)

# Convert final list of comments to a DataFrame
df = pd.DataFrame(comments)

print(df)

                       date                 author  \
0      2023-12-17T13:47:04Z             @SickPrid3   
1      2023-12-17T08:50:58Z  @racheljustrachel2732   
2      2023-12-17T07:01:08Z        @Scorcher-ii1ty   
3      2023-12-17T06:57:02Z        @Scorcher-ii1ty   
4      2023-12-17T01:19:07Z          @AbronHawkins   
...                     ...                    ...   
13599  2023-10-16T16:28:51Z          @slash2freeze   
13600  2023-10-16T16:28:48Z         @kingayman5225   
13601  2023-10-16T16:28:42Z               @JogBird   
13602  2023-10-16T16:27:46Z           @kineticstar   
13603  2023-10-16T16:27:14Z              @Waltaere   

                                                 comment  
0      If technology does not sell itself, it;s doome...  
1      People should have a choice. With prices have ...  
2      Once Trump gets in it’s bye bye EV mandates. Y...  
3      I love how Ford calls an EV a Mustang. Where i...  
4      Let’s say that working class people can’t affo...

In [38]:
## Step 2 - Data Cleaning
import nltk
nltk.download('stopwords')

import re
import string
from nltk.corpus import stopwords

import nltk
nltk.download('wordnet')

# Lowercase
df['comment'] = df['comment'].apply(lambda x: x.lower())

# Remove Punctuations
df['comment'] = df['comment'].apply(lambda x: re.sub(r'[^\w\s]','',x))

# Remove Stopwords
stop_words = set(stopwords.words('english'))
df['comment'] = df['comment'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

# Lemmatization
from textblob import Word

df['comment'] = df['comment'].apply(lambda x: ' '.join([Word(word).lemmatize() for word in x.split()]))

# Store cleaned comments
cleaned_comments = df['comment']

# Print for checking
print(cleaned_comments.head(5))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


0    technology sell doomed failbrevs scam quotgree...
1    people choice price gone salary dont go manufa...
2    trump get bye bye ev mandate wait 6 hour piece...
3    love ford call ev mustang coyote motor garbage...
4    let say working class people cant afford espec...
Name: comment, dtype: object


In [98]:
## Step 3 - Topic Modelling
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['cleaned_comments'])

import gensim
from gensim import corpora
from gensim.utils import simple_preprocess

# Tokenize each comment
df['tokens'] = df['cleaned_comments'].apply(simple_preprocess)

# Create dictionary from tokens
dictionary = corpora.Dictionary(df['tokens'])
corpus = [dictionary.doc2bow(c) for c in df['tokens']]

num_topics = 5
ldamodel = gensim.models.LdaModel(corpus, num_topics=num_topics, id2word=dictionary)

for topic_id, topic in ldamodel.print_topics(-1):
   print(f'Topic {topic_id}: {topic}')

# Create corpus based on tokenized texts
corpus = [dictionary.doc2bow(text) for text in df['tokens']]



Topic 0: 0.036*"ev" + 0.015*"charging" + 0.014*"charge" + 0.013*"gas" + 0.012*"battery" + 0.009*"car" + 0.009*"time" + 0.009*"cost" + 0.009*"tesla" + 0.008*"year"
Topic 1: 0.037*"tesla" + 0.033*"ev" + 0.027*"car" + 0.014*"price" + 0.013*"dealer" + 0.012*"one" + 0.012*"dealership" + 0.009*"sale" + 0.009*"model" + 0.009*"buy"
Topic 2: 0.017*"car" + 0.017*"ev" + 0.014*"battery" + 0.010*"vehicle" + 0.009*"electric" + 0.009*"power" + 0.007*"time" + 0.007*"fire" + 0.007*"go" + 0.006*"fuel"
Topic 3: 0.036*"ev" + 0.019*"car" + 0.018*"charging" + 0.013*"vehicle" + 0.013*"battery" + 0.012*"people" + 0.009*"cost" + 0.009*"range" + 0.009*"infrastructure" + 0.008*"electric"
Topic 4: 0.048*"ev" + 0.016*"want" + 0.016*"people" + 0.012*"car" + 0.011*"government" + 0.011*"don" + 0.010*"get" + 0.007*"buy" + 0.007*"ice" + 0.006*"going"


In [113]:
## Step 4 - TF-IDF Analysis
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Dataframe from extracted and cleaned comments
df = pd.DataFrame({'cleaned_comments': cleaned_comments})

# Create vectorizer
vectorizer = TfidfVectorizer()

# Generate vectors
tfidf_vectors = vectorizer.fit_transform(df['cleaned_comments'])

# Get feature names (terms/tokens)
terms = vectorizer.get_feature_names_out()

# Print vector for first comment document
print(tfidf_vectors[0])

# Get vector densities across corpus
print(tfidf_vectors.shape)

# Print term frequencies for first term
print(tfidf_vectors[:,0].toarray()[0])

# Print few sample terms
feature_names = vectorizer.get_feature_names_out()
first_vector = tfidf_vectors[0]

for idx in first_vector.indices:
   print(feature_names[idx])

  (0, 4593)	0.12582249574342833
  (0, 20731)	0.18535554527984333
  (0, 18198)	0.23516849611041138
  (0, 19450)	0.25534587466924025
  (0, 10674)	0.2447379595752156
  (0, 1646)	0.2455273040899731
  (0, 12871)	0.27259947576643784
  (0, 21280)	0.3097283705460124
  (0, 13797)	0.21852458012636952
  (0, 5511)	0.25816575426747207
  (0, 10492)	0.12653280097592276
  (0, 15744)	0.3457278498973869
  (0, 17223)	0.18223947205491653
  (0, 8129)	0.36060967845441944
  (0, 6712)	0.27259947576643784
  (0, 17396)	0.1552663975842386
  (0, 19058)	0.16514927191895404
(13604, 21531)
[0.]
charge
waiting
spent
time
important
accident
minor
written
often
cost
ice
quotgreenerquot
scam
failbrevs
doomed
sell
technology
