# Analyzing pseudo-news network's homepages

Metric Media claims its network of websites was "established to fill the void in community news". But questionable practices, including [allegations](https://www.nytimes.com/2020/10/18/technology/timpone-local-news-metric-media.html) of a pay-for-play operation, have seen the network [come](https://www.cjr.org/tow_center_reports/hundreds-of-pink-slime-local-news-outlets-are-distributing-algorithmic-stories-conservative-talking-points.php) [under](https://www.cjr.org/analysis/as-election-looms-a-network-of-mysterious-pink-slime-local-news-outlets-nearly-triples-in-size.php) [considerable](https://www.cjr.org/tow_center_reports/metric-media-lobbyists-funding.php) [scrutiny](https://www.cjr.org/tow_center_reports/community-newsmaker-metric-media-local-news.php).

The original purpose of this project was to experiment with SpaCy's named entitity recognition engine to extract the names of public figures mentioned on stories on the homepages of the 156 sites in Metric Media's Arizona, Florida, Georgia, Ohio and West Virginia local 'news' networks .

However, in the process of analyzing the data that resulted from this exercise, my attention was drawn to a curious practice that has not been reported on to date: Many of these websites have started publishing politicians' tweets verbatim and presenting them as news articles.

e.g. The name "Mike Turner" was found in ten headlines. On closer inspection, all had the same format: 
>Ohio U.S. Rep Mike Turner: \[Tweet\]
>
> e.g. [Ohio U.S. Rep Mike Turner: "This is not just affordable housing, this is respectable housing. Dayton's senior residents deserve..."](https://greenecotimes.com/stories/629896011-ohio-u-s-rep-mike-turner-this-is-not-just-affordable-housing-this-is-respectable-housing-dayton-s-senior-residents-deserve)

Consequently, I sought to dig deeper into the extent of this practice. Using the names of the 61 politicians whose tweets appeared as headlines in my original dataset, I used the "Organizations in this article" tags to reach and then scrape the archive of the other articles that had been created from tweets.

This resulted in a dataset of 12,036 articles.

### Step 1: Scrape articles from each homepage in AZ, FL, GA, OH and WV networks

I began by scraping the homepages of each site in the five states' networks.

This produced a dataset of 6722 articles.

(Full disclosure: I built this scraper out one section at a time, so there is a lot of needless repetition that I didn't get around to making leaner!)

In [32]:
from bs4 import BeautifulSoup
import pandas as pd
import requests
from urllib.parse import urljoin

headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.134 Safari/537.36',
    }

# Begin with one website for each state
# URLS for other sites in the state's network are then scraped from these sites
sites = [
    {
        'site_url': 'https://eastarizonanews.com',
        'state': 'Arizona'
    },
    {
        'site_url': 'https://brevardsun.com',
        'state': 'Florida'
    },
    {
        'site_url': 'https://atlstandard.com',
        'state': 'Georgia'
    },
    {
        'site_url': 'https://cantonreporter.com',
        'state': 'Ohio'
    },
    {
        'site_url': 'https://alleghenyhighlandstoday.com',
        'state': 'West Virginia'
    }   
]

articles = []

for site in sites:
    url = site['site_url']
    state = site['state']

    r = requests.get(url, headers=headers)    

    doc = BeautifulSoup(r.text)

    all_sites = doc.select('.art-dropdown a')

    site_urls = []
    # Add initial site to list as it included in dropdown list of other sites in state network
    site_urls.append(url)

    for site in all_sites:
        site_urls.append(site['href'])

    for site_url in site_urls:

        headers = {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.134 Safari/537.36',
        }

        r = requests.get(site_url, headers=headers)

        doc = BeautifulSoup(r.text)

        site_name = doc.select_one('.story-title h1 a').text

        # Get FEATURED SECTION

        main_section = doc.select_one('.main__part')

        section_title = 'Featured'

        # Get top story

        top_story = main_section.select_one('.first-featured-title')

        headline = top_story.select_one('h2.card-title').text.strip()

        url = top_story.select_one('h2.card-title a')['href']
        url = urljoin(site_url, url)

        preview_text = top_story.select_one('p.card-text').text.strip()

        subsection = top_story.select_one('p.news-preview__label').text.strip()

        subsection_url = top_story.select_one('p.news-preview__label a')
        if subsection_url:
            subsection_url = subsection_url['href']
            subsection_url = urljoin(site_url, subsection_url)

        author = top_story.select_one('.card-author a')
        if author:
            author = author.text.strip()

        author_url = top_story.select_one('.card-author a')
        if author_url:
            author_url = author_url['href']
            author_url = urljoin(site_url, author_url)

        article = {
            'site': site_name,
            'site_url': site_url,
            'section': section_title,
            'headline': headline,
            'url': url,
            'subsection': subsection,
            'subsection_url': subsection_url,
            'author': author,
            'author_url': author_url,
            'article_status': 'top_story',
            'preview_text': preview_text,
            'state': state
        }

        articles.append(article)

        ## Get second story

        second_story = main_section.select_one('.second-featured-title')

        if second_story:

            headline = second_story.select_one('h2.card-title').text.strip()

            url = second_story.select_one('h2.card-title a')['href']
            url = urljoin(site_url, url)

            preview_text = second_story.select_one('p.card-text').text.strip()

            subsection = second_story.select_one('p.news-preview__label')
            if subsection:
                subsection = subsection.text.strip()

            subsection_url = second_story.select_one('p.news-preview__label a')
            if subsection_url:
                subsection_url = subsection_url['href']
                subsection_url = urljoin(site_url, subsection_url)

            author = second_story.select_one('.card-author a').text.strip()

            author_url = second_story.select_one('.card-author a')['href']
            author_url = urljoin(site_url, author_url)

            article = {
                'site': site_name,
                'site_url': site_url,
                'section': section_title,
                'headline': headline,
                'url': url,
                'subsection': subsection,
                'subsection_url': subsection_url,
                'author': author,
                'author_url': author_url,
                'article_status': 'second_featured_story',
                'preview_text': preview_text,
                'state': state
            }

            articles.append(article)

        ## Get other featured stories

        top_stories = main_section.select('div.news-preview')

        for story in top_stories:
            if 'first-featured-title' in story['class']:
                pass
            elif 'second-featured-title' in story['class']:
                pass
            else:
                headline = story.select_one('h2.card-title').text.strip()

                url = story.select_one('h2.card-title a')['href']
                url = urljoin(site_url, url)

                preview_text = story.select_one('p.card-text').text.strip()

                subsection = story.select_one('p.news-preview__label').text.strip()

                subsection_url = story.select_one('p.news-preview__label a')
                if subsection_url:
                    subsection_url = subsection_url['href']
                    subsection_url = urljoin(site_url, subsection_url)

                article = {
                    'site': site_name,
                    'site_url': site_url,
                    'section': section_title,
                    'headline': headline,
                    'url': url,
                    'subsection': subsection,
                    'subsection_url': subsection_url,
                    'author': '',
                    'author_url': '',
                    'article_status': '',
                    'preview_text': preview_text,
                    'state': state
                }

                articles.append(article)

        ## Get rest of featured stories

        stories = main_section.select('li a')

        for story in stories:
            headline = story.text
            url = story['href']
            url = site_url + url

            article = {
                'site': site_name,
                'site_url': site_url,
                'section': section_title,
                'headline': headline,
                'url': url,
                'subsection': '',
                'subsection_url': '',
                'author': '',
                'author_url': '',
                'article_status': '',
                'preview_text': '',
                'state': state
            }

            articles.append(article)

        # Get SIDEBAR SECTIONS

        sidebar = doc.select_one('.sidebar')

        sidebar_sections = sidebar.select('.sidebar__part')

        for section in sidebar_sections:
            section_name = section.select_one('.section-title')
            if section_name:
                section_title = section_name.text.strip()

                section_stories = section.select('li')

                for story in section_stories:
                    headline = story.select_one('a')['title']
                    url = story.select_one('a')['href']
                    url = urljoin(site_url, url)
                    author = story.select_one('.card-author a')
                    if author:
                        author_url = author['href']
                        author_url = urljoin(site_url, author_url)
                        author = author.text.strip()
                    else:
                        author_url = ''
                        author = ''

                    article = {
                        'site': site_name,
                        'section': section_title,
                        'site_url': site_url,
                        'headline': headline,
                        'url': url,
                        'subsection': '',
                        'subsection_url': '',
                        'author': author,
                        'author_url': author_url,
                        'article_status': '',
                        'preview_text': '',
                        'state': state
                    }

                    articles.append(article)

        # Get DATA POINTS

        data_points_section = doc.select_one('section.data-points')

        section_title = data_points_section.select_one('h2.section-title').text.strip()

        data_point_stories = data_points_section.select('li')

        for story in data_point_stories:
            headline = story.select_one('h4.title a').text.strip()

            url = story.select_one('h4.title a')['href']
            url = urljoin(site_url, url)

            subsection = story.select_one('p.news-preview__label')
            if subsection:
                subsection = subsection.text.strip()
            else:
                subsection = ''

            subsection_url = story.select_one('p.news-preview__label a')
            if subsection_url:
                subsection_url = subsection_url['href']
                subsection_url = urljoin(site_url, subsection_url)
            else:
                subsection_url = ''            

            article = {
                'site': site_name,
                'site_url': site_url,
                'section': section_title,
                'headline': headline,
                'url': url,
                'subsection': subsection,
                'subsection_url': subsection_url,
                'author': '',
                'author_url': '',
                'article_status': '',
                'preview_text': '',
                'state': state
            }

            articles.append(article)

        articles

        # Get LATEST NEWS

        latest_news = doc.select_one('section .main__part')

        section_title = latest_news.select_one('.section-title').text.strip()

        latest_news_stories = latest_news.select('.card-body')

        for story in latest_news_stories:
            subsection = story.select_one('.news-preview__label')
            if subsection:
                subsection = subsection.text.strip()

            headline = story.select_one('.card-title a').text.strip()

            url = story.select_one('.card-title a')['href']
            url = urljoin(site_url, url)

            preview_text = story.select_one('.card-text').text.strip()

            article = {
                'site': site_name,
                'site_url': site_url,
                'section': section_title,
                'headline': headline,
                'url': url,
                'subsection': subsection,
                'subsection_url': '',
                'author': '',
                'author_url': '',
                'article_status': '',
                'preview_text': preview_text,
                'state': state
            }

            articles.append(article)

        # Get 'NEWS' sections

        news_sections = doc.select('div.home-tag-news')
        section_title = 'News'

        for news_section in news_sections:
            subsection = news_section.select_one('h4').text.strip()

            stories = news_section.select('.news-preview li a')
            for story in stories:
                headline = story.text.strip()

                url = site_url + story['href']
                url = urljoin(site_url, url)

                article = {
                    'site': site_name,
                    'site_url': site_url,
                    'section': section_title,
                    'headline': headline,
                    'url': url,
                    'subsection': subsection,
                    'subsection_url': '',
                    'author': '',
                    'author_url': '',
                    'article_status': '',
                    'preview_text': '',
                    'state': state
                }

                articles.append(article)

df = pd.DataFrame(articles)

### Step 2: Use SpaCy to identify people named in headlines

Having scraped all of the homepages, I used SpaCy to extract names of people mentioned in headlines.

This was imperfect, but sufficient.

The results were added to a list in a new column named `people_mentioned`.

In [33]:
import spacy

nlp = spacy.load("en_core_web_lg")

def find_people(text):
    people = []
    for ent in nlp(text).ents:
        if ent.label_ == 'PERSON':
            person = ent.text
            people.append(person)
    return people

df['people_mentioned'] = df.headline.apply(find_people)

df.to_csv('articles.csv', index=False)

Per the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html), Pandas' `explode` function "Transform each element of a list-like to a row, replicating index values.

In [34]:
df_with_people = df.explode('people_mentioned').drop_duplicates()

df_with_people.to_csv('articles_with_people.csv', index=False)

Which names did SpaCy find most commonly?

In [109]:
df_with_people.people_mentioned.value_counts().head(10)

Biden                26
Trump                21
Warren               11
Ducey                11
Giroux               10
Mike Turner          10
Ashlyn Coleman        9
Jim Justice           9
Octavian Dungee       9
CHRISTIAN ACADEMY     8
Name: people_mentioned, dtype: int64

Most of those aren't that interesting, but Giroux looks worth investigating...

In [144]:
df_with_people[(df_with_people.people_mentioned == 'Giroux')].site.value_counts()

Cincy Reporter    10
Name: site, dtype: int64

All 10 of the mentions of Jenn Giroux appears on the same website, the Cincy Reporter. Does that seem unusual?

In [108]:
df_with_people[df_with_people.site == 'Cincy Reporter'].people_mentioned.value_counts().head(10)

Giroux                10
Maxwell Berghausen     2
Jack Pollock           2
Octavian Dungee        2
Ashlyn Coleman         2
Trump                  1
Veronica Drake         1
Weston Caruso          1
Joe Leet               1
Name: people_mentioned, dtype: int64

Mike Turner appeared ten times too. Is anything of note going on there?

In [147]:
df_with_people[(df_with_people.people_mentioned == 'Mike Turner')].head(3)

Unnamed: 0,site,site_url,section,headline,url,subsection,subsection_url,author,author_url,article_status,preview_text,state,people_mentioned
4766,Greene County Times,https://greenecotimes.com,Data points,"Ohio U.S. Rep Mike Turner: ""In the minimal tim...",https://greenecotimes.com/stories/629020677-oh...,Local Government,https://greenecotimes.com/stories/tag/8-local-...,,,,,Ohio,Mike Turner
4767,Greene County Times,https://greenecotimes.com,Data points,"Ohio U.S. Rep Mike Turner: ""Way to go @univofd...",https://greenecotimes.com/stories/629020706-oh...,Local Government,https://greenecotimes.com/stories/tag/8-local-...,,,,,Ohio,Mike Turner
4768,Greene County Times,https://greenecotimes.com,Data points,"Ohio U.S. Rep Mike Turner: ""ICYMI: The FY23 ND...",https://greenecotimes.com/stories/628974198-oh...,Local Government,https://greenecotimes.com/stories/tag/8-local-...,,,,,Ohio,Mike Turner


All of those headlines appear to follow the same pattern...

In [149]:
df_with_people[(df_with_people.people_mentioned == 'Mike Turner')].headline

4766    Ohio U.S. Rep Mike Turner: "In the minimal tim...
4767    Ohio U.S. Rep Mike Turner: "Way to go @univofd...
4768    Ohio U.S. Rep Mike Turner: "ICYMI: The FY23 ND...
4769    Ohio U.S. Rep Mike Turner: "Americans are expe...
4770    Ohio U.S. Rep Mike Turner: "One year later and...
4771    Ohio U.S. Rep Mike Turner: "The admission of #...
4772    Ohio U.S. Rep Mike Turner: "We think Texas is ...
4783    Ohio U.S. Rep Mike Turner: "In the minimal tim...
4784    Ohio U.S. Rep Mike Turner: "Way to go @univofd...
4785    Ohio U.S. Rep Mike Turner: "ICYMI: The FY23 ND...
Name: headline, dtype: object

### Step 3: Extract articles derived from politicians' tweets

Having noticed this curious habit of publishing tweets as news articles, I wanted to see how widespread this practice was within my sample.

I created a new dataframe, `tw`, that filtered for headlines that followed the pattern identified above:


> \[State\] U.S. Rep \[Name\]: \[Tweet content\]

I then added a new column, `tw.politician`, which pulled out the name of the politician whose tweets were being published.

In [35]:
tw = df[df.headline.str.contains("^.* U.S. Rep .*:.*")].copy()

tw['politician'] = tw.headline.str.split(':', n=1).str[0]

I can immediately see that 58 of the sites in my sample published one or more of these articles.

In [93]:
tw.site.drop_duplicates().size

58

### Step 4: Scrape full archive of articles published about politicians whose tweets were found on homepages

Having established that these Twitter-based 'news' articles were present on 58 sites on August 7, I wanted to dig into how well established this practice was.

All of these articles have a sidebar titled "Organizations in this article". This links to the paginated archive of stories the website has published about the politician in question.

In [None]:
# We only need one example per politician per website
# This gives us the index from which we can filter the df
tw_index = tw[['site_url', 'politician']].drop_duplicates().index

tw_urls = tw.filter(items = tw_index, axis=0).url.tolist()

In [None]:
# Go to each example page and scrape the politician's name and the link to the index of stories mentioning them
# This data is saved to a list named tw_pols

tw_pols = []

for url in tw_urls:
    headers = {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.134 Safari/537.36',
        }

    r = requests.get(url, headers=headers)

    doc = BeautifulSoup(r.text)

    politician = doc.select_one('.fz-14 a.link-red')

    pol_url = politician['href']
    pol_url = requests.compat.urljoin(url, pol_url)
    pol_url = pol_url + '?page='

    pol_name = politician.text
    
    tw_pol = {
        'politician': pol_name,
        'politician_url': pol_url
    }
    
    tw_pols.append(tw_pol)   

Having got the URLs of the paginated archives, I can now iterate through each page of each politician's archive on each website.

The resulting articles are saved to `pol_tw_stories`

In [39]:
pol_tw_stories = []

for pol in tw_pols:
    
    pol_name = pol['politician']
    pol_url = pol['politician_url']
    
    r = requests.get(pol_url, headers=headers)
    doc = BeautifulSoup(r.text)
    
    last_page = doc.select_one('li.last a')
    # Safety net in case there is only one page and therefore no 'Last Page'
    if last_page:
        last_page = last_page['href']
        last_page = int(last_page.split('=')[1])
    else:
        last_page = 1

    page_no = 1

    while page_no <= last_page:
        
        url = f'{pol_url}{page_no}'
        print(url)

        headers = {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.134 Safari/537.36',
        }

        r = requests.get(url, headers=headers)

        doc = BeautifulSoup(r.text)

        stories = doc.select('.content li .card-body')

        for story in stories:
            link = story.select_one('h2 a')

            headline = link.text
            story_url = link['href']
            story_url = requests.compat.urljoin(pol_url, story_url)

            preview = story.select_one('p.card-text').text
            author = story.select_one('.card-author').text

            article = {
                'politician': pol_name,
                'headline': headline,
                'url': story_url.strip(),
                'preview_text': preview.strip(),
                'author': author.strip()
            }

            pol_tw_stories.append(article)

        page_no += 1

https://eastarizonanews.com/organizations/647622906-tom-o-halleran/stories?page=1
https://eastarizonanews.com/organizations/647622906-tom-o-halleran/stories?page=2
https://eastarizonanews.com/organizations/647622906-tom-o-halleran/stories?page=3
https://eastarizonanews.com/organizations/647622906-tom-o-halleran/stories?page=4
https://eastarizonanews.com/organizations/647622906-tom-o-halleran/stories?page=5
https://eastarizonanews.com/organizations/647622906-tom-o-halleran/stories?page=6
https://eastarizonanews.com/organizations/647622906-tom-o-halleran/stories?page=7
https://eastarizonanews.com/organizations/647622906-tom-o-halleran/stories?page=8
https://eastarizonanews.com/organizations/647622906-tom-o-halleran/stories?page=9
https://eastarizonanews.com/organizations/647622906-tom-o-halleran/stories?page=10
https://mohavetoday.com/organizations/647622678-paul-a-gosar/stories?page=1
https://mohavetoday.com/organizations/647622678-paul-a-gosar/stories?page=2
https://mohavetoday.com/org

https://emeraldcoasttimes.com/organizations/647622774-matt-gaetz/stories?page=12
https://emeraldcoasttimes.com/organizations/647622774-matt-gaetz/stories?page=13
https://emeraldcoasttimes.com/organizations/647622774-matt-gaetz/stories?page=14
https://emeraldcoasttimes.com/organizations/647622774-matt-gaetz/stories?page=15
https://hernandoreporter.com/organizations/647622604-daniel-webster/stories?page=1
https://hernandoreporter.com/organizations/647622604-daniel-webster/stories?page=2
https://hernandoreporter.com/organizations/647622604-daniel-webster/stories?page=3
https://hernandoreporter.com/organizations/647622604-daniel-webster/stories?page=4
https://keywestreporter.com/organizations/647622567-carlos-a-gimenez/stories?page=1
https://keywestreporter.com/organizations/647622567-carlos-a-gimenez/stories?page=2
https://keywestreporter.com/organizations/647622567-carlos-a-gimenez/stories?page=3
https://keywestreporter.com/organizations/647622567-carlos-a-gimenez/stories?page=4
https://

https://okeechobeetimes.com/organizations/647622926-w-gregory-steube/stories?page=2
https://okeechobeetimes.com/organizations/647622926-w-gregory-steube/stories?page=3
https://okeechobeetimes.com/organizations/647622926-w-gregory-steube/stories?page=4
https://okeechobeetimes.com/organizations/647622926-w-gregory-steube/stories?page=5
https://okeechobeetimes.com/organizations/647622926-w-gregory-steube/stories?page=6
https://okeechobeetimes.com/organizations/647622926-w-gregory-steube/stories?page=7
https://okeechobeetimes.com/organizations/647622926-w-gregory-steube/stories?page=8
https://okeechobeetimes.com/organizations/647622926-w-gregory-steube/stories?page=9
https://okeechobeetimes.com/organizations/647622926-w-gregory-steube/stories?page=10
https://okeechobeetimes.com/organizations/647622926-w-gregory-steube/stories?page=11
https://panamacityreporter.com/organizations/647622820-neal-p-dunn/stories?page=1
https://panamacityreporter.com/organizations/647622820-neal-p-dunn/stories?p

https://stpetestandard.com/organizations/647622575-charlie-crist/stories?page=11
https://tallahasseesun.com/organizations/647622495-al-lawson-jr/stories?page=1
https://tallahasseesun.com/organizations/647622495-al-lawson-jr/stories?page=2
https://tallahasseesun.com/organizations/647622495-al-lawson-jr/stories?page=3
https://tallahasseesun.com/organizations/647622495-al-lawson-jr/stories?page=4
https://tallahasseesun.com/organizations/647622495-al-lawson-jr/stories?page=5
https://tallahasseesun.com/organizations/647622495-al-lawson-jr/stories?page=6
https://treasurecoastsun.com/organizations/647622558-brian-j-mast/stories?page=1
https://treasurecoastsun.com/organizations/647622558-brian-j-mast/stories?page=2
https://treasurecoastsun.com/organizations/647622558-brian-j-mast/stories?page=3
https://treasurecoastsun.com/organizations/647622558-brian-j-mast/stories?page=4
https://treasurecoastsun.com/organizations/647622558-brian-j-mast/stories?page=5
https://treasurecoastsun.com/organizatio

https://ncgeorgianews.com/organizations/647622374-a-drew-ferguson/stories?page=4
https://ncgeorgianews.com/organizations/647622374-a-drew-ferguson/stories?page=5
https://nwatlantanews.com/organizations/647622527-barry-loudermilk/stories?page=1
https://nwatlantanews.com/organizations/647622527-barry-loudermilk/stories?page=2
https://nwatlantanews.com/organizations/647622527-barry-loudermilk/stories?page=3
https://nwatlantanews.com/organizations/647622527-barry-loudermilk/stories?page=4
https://nwatlantanews.com/organizations/647622527-barry-loudermilk/stories?page=5
https://nwatlantanews.com/organizations/647622527-barry-loudermilk/stories?page=6
https://nwatlantanews.com/organizations/647622527-barry-loudermilk/stories?page=7
https://northgwinnettnews.com/organizations/647622569-carolyn-bourdeaux/stories?page=1
https://northgwinnettnews.com/organizations/647622569-carolyn-bourdeaux/stories?page=2
https://northgwinnettnews.com/organizations/647622569-carolyn-bourdeaux/stories?page=3
htt

https://delcoreview.com/organizations/647622915-troy-balderson/stories?page=5
https://delcoreview.com/organizations/647622915-troy-balderson/stories?page=6
https://greenecotimes.com/organizations/647622804-michael-r-turner/stories?page=1
https://greenecotimes.com/organizations/647622804-michael-r-turner/stories?page=2
https://greenecotimes.com/organizations/647622804-michael-r-turner/stories?page=3
https://greenecotimes.com/organizations/647622804-michael-r-turner/stories?page=4
https://greenecotimes.com/organizations/647622804-michael-r-turner/stories?page=5
https://greenecotimes.com/organizations/647622804-michael-r-turner/stories?page=6
https://greenecotimes.com/organizations/647622804-michael-r-turner/stories?page=7
https://greenecotimes.com/organizations/647622804-michael-r-turner/stories?page=8
https://greenecotimes.com/organizations/647622804-michael-r-turner/stories?page=9
https://lakecountytimes.com/organizations/647622862-shontel-brown/stories?page=1
https://lakecountytimes.c

https://eastpanhandletimes.com/organizations/647622618-david-j-trone/stories?page=1
https://eastpanhandletimes.com/organizations/647622618-david-j-trone/stories?page=2
https://eastpanhandletimes.com/organizations/647622618-david-j-trone/stories?page=3
https://eastpanhandletimes.com/organizations/647622618-david-j-trone/stories?page=4
https://eastpanhandletimes.com/organizations/647622618-david-j-trone/stories?page=5
https://eastpanhandletimes.com/organizations/647622618-david-j-trone/stories?page=6
https://eastpanhandletimes.com/organizations/647622618-david-j-trone/stories?page=7
https://eastpanhandletimes.com/organizations/647622618-david-j-trone/stories?page=8
https://eastpanhandletimes.com/organizations/647622618-david-j-trone/stories?page=9
https://eastpanhandletimes.com/organizations/647622618-david-j-trone/stories?page=10
https://eastpanhandletimes.com/organizations/647622618-david-j-trone/stories?page=11
https://eastpanhandletimes.com/organizations/647622618-david-j-trone/stori

In [None]:
# Convert into a Pandas dataframe
tw_stories = pd.DataFrame(pol_tw_stories)

I forgot to include the site name and state in the `tw_stories` dataframe, so need to fix this.

This is easily done because the byline for these articles is always "By [site name]". I therefore split the `author` column on "By ", create a dataframe of unique `site` and `state` combinations, then join on `site`

In [133]:
tw_stories['site'] = tw_stories.author.str.split("By ").str[1]

sites = df[['site', 'state']].drop_duplicates()

tw_stories = tw_stories.merge(sites, on='site')

This process found 12,036 articles created from 61 politicians' Tweets across 58 Metric Media sites.

Here's a little preview of the 12,036 row dataframe:

In [140]:
tw_stories

Unnamed: 0,politician,headline,url,preview_text,author,site,state
0,Tom O'Halleran,"Arizona U.S. Rep Tom O'Halleran: ""Today is #Fa...",https://eastarizonanews.com/stories/629945258-...,"Tom O'Halleran has 17,505 Twitter followers as...",By East Arizona News,East Arizona News,Arizona
1,Tom O'Halleran,"Arizona U.S. Rep Tom O'Halleran: ""Honored to a...",https://eastarizonanews.com/stories/629896216-...,"Tom O'Halleran has 17,505 Twitter followers as...",By East Arizona News,East Arizona News,Arizona
2,Tom O'Halleran,"Arizona U.S. Rep Tom O'Halleran: ""REMINDER: I'...",https://eastarizonanews.com/stories/629896027-...,"Tom O'Halleran has 17,505 Twitter followers as...",By East Arizona News,East Arizona News,Arizona
3,Tom O'Halleran,"Arizona U.S. Rep Tom O'Halleran: ""Last week, I...",https://eastarizonanews.com/stories/629865735-...,"Tom O'Halleran has 17,505 Twitter followers as...",By East Arizona News,East Arizona News,Arizona
4,Tom O'Halleran,"Arizona U.S. Rep Tom O'Halleran: ""Last week, I...",https://eastarizonanews.com/stories/629832114-...,"Tom O'Halleran has 17,505 Twitter followers as...",By East Arizona News,East Arizona News,Arizona
...,...,...,...,...,...,...,...
12031,Carol D. Miller,"West Virginia U.S. Rep Carol Miller: ""ICYMI: I...",https://huntingtontimes.com/stories/624093828-...,"Carol Miller has 11,026 Twitter followers as o...",By Huntington Times,Huntington Times,West Virginia
12032,Carol D. Miller,"West Virginia U.S. Rep Carol Miller: ""I spoke ...",https://huntingtontimes.com/stories/623186452-...,"Carol Miller has 11,026 Twitter followers as o...",By Huntington Times,Huntington Times,West Virginia
12033,Carol D. Miller,"West Virginia U.S. Rep Carol Miller: ""Commonse...",https://huntingtontimes.com/stories/622801049-...,"Carol Miller has 11,026 Twitter followers as o...",By Huntington Times,Huntington Times,West Virginia
12034,Carol D. Miller,"West Virginia U.S. Rep Carol Miller: ""Inflatio...",https://huntingtontimes.com/stories/622801080-...,"Carol Miller has 11,026 Twitter followers as o...",By Huntington Times,Huntington Times,West Virginia


Which of the sites in our sample had the most Twitter stories?

In [134]:
tw_stories.site.value_counts().head(10)

South Orlando News        933
NW Valley Times           604
Rome Reporter             568
SE Valley Times           538
Central Ohio Today        438
Lee Today                 382
Treasure Coast Sun        379
NC Florida News           371
ATL Standard              363
West Hillsborough News    328
Name: site, dtype: int64

Which politicians' tweets have been the subject of most articles?

In [138]:
top_10_pols = tw_stories.politician.value_counts().head(10).index

tw_stories[tw_stories.politician.isin(top_10_pols)].groupby('politician').site.value_counts().sort_values(ascending=False)

politician             site              
Debbie Lesko           NW Valley Times       604
Majorie Taylor Greene  Rome Reporter         568
Val Butler Demings     South Orlando News    555
Andy Biggs             SE Valley Times       538
Jim Jordan             Central Ohio Today    438
Byron Donalds          Lee Today             382
Brian J. Mast          Treasure Coast Sun    379
Darren Soto            South Orlando News    378
Kat Cammack            NC Florida News       371
Nikema Williams        ATL Standard          363
Name: site, dtype: int64