# Project 2: Using MLTA to show the Perception of AI Innovation Across Various Industries
## Econ 1680: MLTA and Econ

#### Name: Isha Ponugoti

## I. Importing and Cleaning Data

### TechCrunch AI Articles
This source is publicly available at https://techcrunch.com/category/artificial-intelligence/. I start by using Selenium to automate the process of clicking load more to get all the articles available on the TechCrunch website within the AI category. I then use Beautiful Soup to parse through the retrieved extended HTML file. I use this information to create a dataframe containing the following information: article title, article summary text, publication date, and article href.

In [33]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import os
import csv

import random
import statsmodels.api as sm
from statsmodels.iolib.summary2 import summary_col
from math import sqrt

import bs4
import pandas as pd
import requests

from datetime import datetime, timezone

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import Lasso, LassoCV
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.preprocessing import normalize
import scipy.cluster.hierarchy as shc
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import time

In [35]:
# Parsing url without pressing load more:
# url = "https://techcrunch.com/category/artificial-intelligence/"
# response = requests.get(url)

# soup = bs4.BeautifulSoup(response.text, "html.parser")
# article_titles, article_contents, article_hrefs, article_dates = [], [], [], []

# for tag in soup.findAll("div", {"class": "post-block post-block--image post-block--unread"}):
#     tag_header = tag.find("a", {"class": "post-block__title__link"})
#     tag_content = tag.find("div", {"class": "post-block__content"})
#     tag_date = tag.find("time", {"class": "river-byline__time"})

#     article_title = tag_header.get_text().strip()
#     article_href = tag_header["href"]
#     article_content = tag_content.get_text().strip()
#     article_date = tag_date["datetime"].strip()
#     article_date=datetime.strptime(article_date, '%Y-%m-%dT%H:%M:%S%z').astimezone()
#     article_date=raw_datetime.strftime('%I:%M %p %Z %B %d, %Y')

#     article_titles.append(article_title)
#     article_contents.append(article_content)
#     article_hrefs.append(article_href)
#     article_dates.append(article_date)


In [101]:
# Path to your WebDriver executable
driver_path = '/Users/ishaponugoti/Downloads/chromedriver-mac-arm64/chromedriver'

# Create a Service object with the path to the chromedriver
service = Service(executable_path=driver_path)

# Initialize the WebDriver
driver = webdriver.Chrome(service=service)

# URL of the page you want to scrape
url = "https://techcrunch.com/category/artificial-intelligence/"

# Open the page
driver.get(url)

# Initialize lists to store parsed info
article_titles, article_contents, article_hrefs, article_dates = [], [], [], []

# Try pressing load more 100 times
for i in range(100):
    # Wait for the new content to load after clicking "Load More"
    time.sleep(5)

    # Find the "Load More" button and click it (to load ~17 more articles)
    try:
        load_more_button = driver.find_element(By.CSS_SELECTOR, 'button.load-more')
        load_more_button.click()
    except Exception as e:
        print(f"Error finding or clicking 'Load More' button: {e}")
        break

In [102]:
# Parse page source with BeautifulSoup
html_source = driver.page_source
soup = bs4.BeautifulSoup(html_source, "html.parser")

# Proceed with parsing `html_source` using BeautifulSoup or another HTML parser
for tag in soup.findAll("article", {"class": "post-block post-block--image post-block--unread"}):
    tag_header = tag.find("a", {"class": "post-block__title__link"})
    tag_content = tag.find("div", {"class": "post-block__content"})
    tag_date = tag.find("time", {"class": "river-byline__full-date-time"})

    article_title = tag_header.get_text().strip()
    article_href = tag_header["href"]
    article_content = tag_content.get_text().strip()
    article_date = tag_date["datetime"].strip()
    article_date = datetime.strptime(article_date, '%Y-%m-%dT%H:%M:%S').astimezone()
    article_date = article_date.strftime('%I:%M %p %Z %B %d, %Y')

    article_titles.append(article_title)
    article_contents.append(article_content)
    article_hrefs.append(article_href)
    article_dates.append(article_date)
        
# Close the WebDriver
driver.quit()

# print(len(article_titles))

In [110]:
print("Total articles: " + str(len(article_titles)))
print("Total unique articles: " + str(len(set(article_titles))))

Total articles: 1005
Total unique articles: 1005


Through the code above, I have succesfully parsed through 1005 articles. Next, I create a pandas DataFrame from the collected data that I stored in the lists: article_titles, article_contents, article_hrefs, article_dates.

In [133]:
for a in article_titles:
    if "Robotics" in a:
        print(a)

Agility Robotics lays off some staff amid commercialization focus
Neura Robotics picks up $55M to ramp up in cognitive robotics
Rice Robotics picks up $7M, powers SoftBank’s office delivery
AMP Robotics attracts investment from Microsoft’s Climate Innovation Fund
Robotics safety firm Veo raises $29 million, with help from Amazon


In [114]:
data = {
    'Title': article_titles,
    'Content': article_contents,
    'URL': article_hrefs,
    'Date': article_dates
}

df = pd.DataFrame(data)

print(df.head())

                                               Title  \
0  What we’ve learned from the women behind the A...   
1  Sundar Pichai on the challenge of innovating i...   
2  Meta’s new AI deepfake playbook: More labels, ...   
3  TechCrunch Minute: YC Demo Day’s biggest showc...   
4  Rubrik’s IPO filing reveals an AI governance c...   

                                             Content  \
0  The AI boom, love it or find it to be a bit mo...   
1  It was a notable appearance because Pichai’s b...   
2  Meta has announced changes to its rules on AI-...   
3  Well-known startup accelerator Y Combinator he...   
4  Rubrik, the data management company that filed...   

                                                 URL  \
0  /2024/04/06/what-weve-learned-from-the-women-b...   
1  /2024/04/05/sundar-pichai-on-the-challenge-of-...   
2                  /2024/04/05/meta-deepfake-labels/   
3  /2024/04/05/techcrunch-minute-ycs-demo-day-hig...   
4        /2024/04/05/rubrik-ai-governance-comm

I then go through each URL to access the full articles and add TechCrunch's keyword tags to later help me assign articles to the following industries: healthcare, education, finance, media/entertainment, logistics and transportation, technology, and other.

In [198]:
def extract_tags_from_url(URL):
    url = "https://techcrunch.com" + URL
    
    response = requests.get(url)
    soup = bs4.BeautifulSoup(response.text, 'html.parser')
    
    # Find the <meta> tag with name="sailthru.tags"
    meta_tag = soup.find('meta', attrs={'name': 'sailthru.tags'})
    
    if meta_tag and meta_tag.has_attr('content'):
        # Split the content attribute by ", " to get the list of tags
        tags = meta_tag['content'].split(', ')
        return tags
    else:
        print("No sailthru.tags meta tag found.")
        return []


In [201]:
# Apply the function to df
df['tags'] = df['URL'].apply(extract_tags_from_url)

No sailthru.tags meta tag found.
No sailthru.tags meta tag found.
No sailthru.tags meta tag found.
No sailthru.tags meta tag found.
No sailthru.tags meta tag found.
No sailthru.tags meta tag found.
No sailthru.tags meta tag found.
No sailthru.tags meta tag found.


In [202]:
df.head()

Unnamed: 0,Title,Content,URL,Date,tags
0,What we’ve learned from the women behind the A...,"The AI boom, love it or find it to be a bit mo...",/2024/04/06/what-weve-learned-from-the-women-b...,"02:05 PM EDT April 06, 2024","[AI policy, AI startups, Equity podcast, Women..."
1,Sundar Pichai on the challenge of innovating i...,It was a notable appearance because Pichai’s b...,/2024/04/05/sundar-pichai-on-the-challenge-of-...,"05:59 PM EDT April 05, 2024","[Alphabet, artificial intelligence, Google, go..."
2,"Meta’s new AI deepfake playbook: More labels, ...",Meta has announced changes to its rules on AI-...,/2024/04/05/meta-deepfake-labels/,"05:29 PM EDT April 05, 2024",[meta deepfake and manipulated media policy]
3,TechCrunch Minute: YC Demo Day’s biggest showc...,Well-known startup accelerator Y Combinator he...,/2024/04/05/techcrunch-minute-ycs-demo-day-hig...,"04:00 PM EDT April 05, 2024","[AI, Startups, the techcrunch minute, Y Combin..."
4,Rubrik’s IPO filing reveals an AI governance c...,"Rubrik, the data management company that filed...",/2024/04/05/rubrik-ai-governance-committee/,"01:41 PM EDT April 05, 2024","[AI governance, EU AI Act, onetrust, Rubrik]"


In [208]:
tags_list = df["tags"].tolist()
flat = [item for sublist in my_list for item in sublist]
set(flat)

{'industrial robot',
 "Shaquille O'Neal",
 'EC Enterprise',
 'AWS',
 'Talkdesk',
 'Zoom Ventures',
 'WordPress',
 'artificial intelligence',
 'Brendan Iribe',
 'Java',
 'consortium',
 'Samsung Electronics',
 'universe',
 'Argonautic Ventures',
 'messages',
 'eu',
 'Intrinsic',
 'algorithmic transparency',
 'The TechCrunch Podcast',
 'Dropbox',
 'mosaicml',
 'iOS apps',
 'Deepset',
 'openai data protection',
 'Azure',
 'Caden',
 'AI',
 'shopping',
 'samsung next',
 'Khosla Ventures',
 'BetterData',
 'dating apps',
 'european union',
 'moderation',
 'anthropic',
 'Elicit',
 'Softbank',
 'steaming',
 'quantexa',
 'meta earnings',
 'voice assistant',
 'software development',
 'sergey gribov',
 'eu us ttc joint statement',
 'edsoma',
 'webflow',
 'allen institute for ai',
 'synthetic voice',
 'Gradient Ventures',
 'mixed reality',
 'Robust.AI',
 'Pinterest',
 'iRobot',
 'google deepmind',
 'IBM',
 'Nova',
 'Atreides Management',
 'Better.com',
 'eu digital services act',
 'opioid epidemic',

In [209]:
len(set(flat))

1589

Because the tags are quite niche and there are 1589 unique tags (that don't all correlate to an industry), I am struggling to extract the industries for a given article.

I try using a keyword approach to classify the text as related to different industries.

In [211]:
industry_keywords = {
    'Healthcare': ['health', 'medical', 'clinic', 'disease', 'patient', 'surgery', 'therapy', 'pharmaceutical', 'diagnos', 'treatment'],
    'Education': ['school', 'education', 'student', 'university', 'college', 'curriculum', 'academic', 'scholar', 'learning', 'teach', 'classroom'],
    'Finance': ['finance', 'banking', 'stock', 'investment', 'crypto', 'currency', 'financial', 'market', 'economy', 'trade', 'wealth'],
    'Media/Entertainment': ['media', 'entertainment', 'film', 'music', 'tv', 'television', 'stream', 'game', 'gaming', 'video', 'movie', 'podcast'],
    'Logistics and Transportation': ['logistics', 'transportation', 'shipping', 'freight', 'cargo', 'delivery', 'supply chain', 'truck', 'vehicle', 'flight'],
    'Technology': ['software', 'hardware', 'cloud', 'computing', 'data', 'network', 'SaaS', 'internet', 'cyber', 'tech', 'digital', 'gadget'],
    'Other': []  # This will be used as a fallback category
}

# The default category for articles that don't match any keyword
default_category = 'Other'


In [212]:
def classify_article_by_keywords(article):
    for industry, keywords in industry_keywords.items():
        if any(keyword.lower() in article.lower() for keyword in keywords):
            return industry
    return default_category

article_industries = [classify_article_by_keywords(article) for article in article_contents]

In [214]:
df['Industry'] = article_industries
industry_counts = df['Industry'].value_counts()
print(industry_counts)

Other                           538
Technology                      218
Media/Entertainment              97
Finance                          63
Education                        58
Healthcare                       23
Logistics and Transportation      8
Name: Industry, dtype: int64


Using the keywords, I was able to classify nearly half of the data. If I have time, I plan to revist my industry tagging, using more keywords or a more sophisticated classification technique so I have more insightful data. 

## II. Descriptive Analysis
Next, I go through the key variables that I will use throughout my methods section and display some summary statistics to start my analysis.

## III. Method (Text Analysis)
I then employ sentiment analysis to analyze the sentiment of TechCrunch articles within each industry, categorizing sentiments as positive, neutral, or negative towards AI within that industry.

Next, I use my dataframe to track the number of AI mentions over time to identify the intensity of AI innovation within an industry. 

Next, I use time series analysis to track the evolution of AI coverage.

Next, I use Autoregressive Integrated Moving Average (ARIMA) to predict how much AI will be discussed within the next year.