# Scraping Wikipedia to prove ~~or disprove~~ A CURSE

In this notebook, I'm going to have a go at scraping a Wikipedia table, then cross-referencing it with other Wikipedia articles to get a richer dataset, and then doing some summaries to compare to national statistics.

The "Strictly Curse" comes up in discussion every Winter, when celebrity contestants on Strictly Come Dancing are always rumoured to be having affairs with their professional partners & then leaving their spouses. I thought it'd be interesting to see if, statistically, there's anything to it.

## Section 1: Get everything I can from the main table

en.wikipedia.org/wiki/List_of_Strictly_Come_Dancing_contestants has a list of everyone who's competed on Strictly since it started in 2004. I'm going to use that to get the main list of contestants, then have a look into their own wiki pages (later on) to pull out their marital status(es).

In [1]:
# Import the relevant packages

import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import math

Getting the SCD UK data:

In [2]:
# Get the page, then soupify it.
page_url = "https://en.wikipedia.org/wiki/List_of_Strictly_Come_Dancing_contestants"
page = requests.get(page_url)
soup = BeautifulSoup(page.content,'html.parser')

# Find the right table
main_table = soup.find(class_="wikitable sortable")

# Get the contestants from that table
con_info = main_table.find_all('tr')

# Count the contestants
num = len(list(con_info))

# Set up empty arrays for the fields we need
#   (literally just the name and the series, so we can 
#    work out the year they competed.)
con_names=[]
con_series=[]

# Populate the info fields using the "td" tags
for i in range(1,num):
    con_names.append(con_info[i].find_all("td")[0].get_text().rstrip('\n'))
    con_series.append(con_info[i].find_all("td")[4].get_text().rstrip('\n'))
    
# Combine all those lists into a table
allinfo = {'Name':con_names,
          'Series':con_series}

# Put that table into a dataframe
df = pd.DataFrame(allinfo)

# Make a copy in case I need to go back and see what it originally was
df_Original = df.copy()

# Convert to the right datatypes
df = df.astype({'Series':'int32'})

# Add a "year of competition" column
# This only works for the UK!
df['Competing Year']= np.where(df['Series']==1,2004,df['Series']+2002)
df = df.drop('Series',axis=1)

Now getting some from the US! <br>
<br><em> Note: I included some from the US, then decided that I didn't need more data, I needed a better initial estimator of the H0 mean. So I'm going to leave this step out, for now.

In [3]:
page = requests.get("https://en.wikipedia.org/wiki/List_of_Dancing_with_the_Stars_(American_TV_series)_competitors")
soup = BeautifulSoup(page.content,'html.parser')

table = soup.find(class_='wikitable sortable')
table_rows = table.find_all('tr')
num = len(table_rows)

celeb_names=[]
celeb_series=[]
for i in range(1,num):
    # Get the first entry in the row
    first_entry = table_rows[i].find('td').get_text()
    second_entry = table_rows[i].find_all('td')[1]
    try:
        # If it's a number, it's the series number
        series_num = pd.to_numeric(first_entry)
        celeb_series.append(series_num)
        celeb_names.append(second_entry.get_text().rstrip('\n'))
    except:
        # It might have a different series name
        if first_entry.replace('\n',"") == '15(All-Stars)':
            series_num = 15
            celeb_names.append(table_rows[i].find_all('td')[1].get_text().rstrip('\n'))
        else:
            if first_entry.replace('\n',"") == 'Juniors':
                series_num = 100
                celeb_names.append(table_rows[i].find_all('td')[1].get_text().rstrip('\n'))
            else:
                celeb_names.append(table_rows[i].find('td').get_text().rstrip('\n'))
        
        celeb_series.append(series_num)

# Then put it in a dataframe

US_info = {'Name':celeb_names, 'Series':celeb_series}

dfUS = pd.DataFrame(data=US_info)
dfUS=dfUS[dfUS['Series']!=100]

# Set up a dictionary for the years
series_years = {
        1:2005,2:2006,3:2006,4:2007,5:2007,
        6:2008,7:2008,8:2009,9:2009,10:2010,
        11:2010,12:2011,13:2011,14:2012,15:2012,
        16:2013,17:2013,18:2014,19:2014,20:2015,
        21:2015,22:2016,23:2016,24:2017,25:2017,
        26:2018,27:2018,28:2019}

# and a function for looking it up
def USYears(season):
    return(season.get(series_years))

dfUS['Competing Year']=dfUS['Series'].apply(lambda x:series_years.get(x))
dfUS = dfUS[['Name','Competing Year']].astype({'Competing Year':'int32'})

(If I wanted to use the US data, I'd:) <em>Join the two datasets together!

In [4]:
#listBoth=[df,dfUS]
#dfBoth = df.copy()

## Section 2: Going to individuals' wiki pages to pull cross-info from them 

For this, I'm just going to wiki/persons_name, and all the info (as far as I can see) is in the info box down the right hand side. It tends to be there for most of the competitors, so I think there's little enough leakage that I won't go looking any harder for it right now!

In [5]:
df2 = df.copy()
df2 = df2.reset_index(drop=True)
print(df2.shape)

(237, 2)


In [6]:
# A test function, to make sure everyone has a wiki page that I can get to
def HasAWikiPage(name):
    # First, get their wiki page up
    spaces = name.replace(" ","_")
    wiki_url = "https://en.wikipedia.org/wiki/" + spaces
    # print(wiki_url)
    page = requests.get(wiki_url)
    
    # If it can't find the page, don't do anything else
    if page.status_code != 200:
        return(0)
    
    return(1)

In [102]:
# A quick test shows that 99% of my contestants have at least a wiki page I can pull up - fab!

df_Test=pd.DataFrame(df2['Name'],columns=["Name"])
#df_Test['WikiPage']=df_Test['Name'].apply(lambda x:HasAWikiPage(x))
#df_Test.describe()

# Although, hilariously, Alex Jones (that's the only one I've spotted, there may be more) is pulling up
#   the far-right conspiracy theorist of the same name from the US. Not sure what to do about that one!

In [29]:
def GetMarriageStatus_fromname(their_name):
    wiki_url = "https://en.wikipedia.org/wiki/" + their_name.replace(" ","_")
    page = requests.get(wiki_url)
    
    # If it can't find the page, don't do anything else
    if page.status_code != 200:
        return(0)
    
    # Find the "Spouse(s)" bit of their bio
    soup = BeautifulSoup(page.content,'html.parser')
    
    try:
        infobox = soup.find(class_="infobox biography vcard")
    except AttributeError:
        return(0)
    
    #print(infobox.prettify())
    
    try:
        infolines = infobox.find_all("tr")
    except AttributeError:
        return(0)
    
    # Put the headings in as a list
    stringrows=[]
    for i in range(0,len(infolines)):
        stringrows.append(infolines[i].get_text())
        # Then if we have the "Spouse" row, return it
        if "Spouse(s)" in stringrows[i]:
            spousetext = stringrows[i].lstrip("Spouse(s)")
            return(spousetext)
        if "Spouse" in stringrows[i]:
            spousetext = stringrows[i].lstrip("Spouse(s)")
            return(spousetext)        
    return(0)

def GetMarriageStatus_fromindex(index):
    # First, get their wiki page up
    their_name = df2.iloc[index,0]
    # print(their_name + ":\n") 
    return GetMarriageStatus_fromname(their_name)

In [30]:
# Put the spouses info in the new table
df2['Spouses String']=(df2.index.map(lambda x:GetMarriageStatus_fromindex(x)))

## Section 3: Processing the data I already have now, to allow me to answer the question I wanted to

In [31]:
# New dataframe with only the ones with marriage information
df3 = df2[df2['Spouses String']!=0]
df3 = df3.reset_index(drop=True)
df3.tail(5)

Unnamed: 0,Name,Competing Year,Spouses String
88,Kate Silverton,2018,Mike Heron (m. 2010)
89,Catherine Tyldesley,2019,Tom Pitfield (m. 2016)
90,Mike Bushell,2019,Emily (m. 2019)
91,Michelle Visage,2019,David Case
92,Kelvin Fletcher,2019,Eliza Marsland (m. 2015)


In [53]:
# A function to return a tuple of the marriage and divorce years.
# Then the odd ones are marriages and the even ones are divorces!

def MarDivList_fromstring(ss,whichone):
    # I'm really looking for either 2-digit or 4-digit numerical sequences
    years = []
    i=0
    while i in range(0,len(ss)-1):
        if ss[i:i+2].isnumeric():
            if ss[i:i+4].isnumeric():
                years.append(int(ss[i:i+4]))
                i=i+4
            else:
                years.append(1900+int(ss[i:i+2]))
                i=i+2
        i=i+1
    while len(years)<8:
        years.append(0)
    return(years[whichone])


def MarDivList_fromindex(index,whichone):
    # Spouse String
    ss = df3.iloc[index,2]
    return(MarDivList_fromstring(ss,whichone))

In [33]:
# Add all the marriage & divorce columns

df3['First Marriage']=(df3.index.map(lambda x:MarDivList(x,0)))
df3['First Divorce']=(df3.index.map(lambda x:MarDivList(x,1)))
df3['Second Marriage']=(df3.index.map(lambda x:MarDivList(x,2)))
df3['Second Divorce']=(df3.index.map(lambda x:MarDivList(x,3)))
df3['Third Marriage']=(df3.index.map(lambda x:MarDivList(x,4)))
df3['Third Divorce']=(df3.index.map(lambda x:MarDivList(x,5)))

df3['Fourth Marriage']=(df3.index.map(lambda x:MarDivList(x,6)))
df3['Fourth Divorce']=(df3.index.map(lambda x:MarDivList(x,7)))

# And then lost the string about the spouses
df3=df3.drop('Spouses String',axis=1)

In [34]:
df3.tail()

Unnamed: 0,Name,Competing Year,First Marriage,First Divorce,Second Marriage,Second Divorce,Third Marriage,Third Divorce,Fourth Marriage,Fourth Divorce
88,Kate Silverton,2018,2010,0,0,0,0,0,0,0
89,Catherine Tyldesley,2019,2016,0,0,0,0,0,0,0
90,Mike Bushell,2019,2019,0,0,0,0,0,0,0
91,Michelle Visage,2019,0,0,0,0,0,0,0,0
92,Kelvin Fletcher,2019,2015,0,0,0,0,0,0,0


## Section 4: Dropping the people outside of our time ranges
I don't need the information about people who were already divorced, or weren't yet married at the time of their competition - so get rid!

In [35]:
df4=df3.copy()

In [36]:
def MarriedAtTheTime(start,end,comp):
    if ((comp>=start and comp<=end)or(start != 0 and comp>=start and end==0)):
        return(1)
    else: return(0)
    
def WithinXYears(div,comp,x):
    if (div != 0 and div-comp <=x and comp<=div):
        return(1)
    else: return (0)

In [37]:
df4['Married at the Time']= \
    df4.apply(lambda x: MarriedAtTheTime(x['First Marriage'], x['First Divorce'],x['Competing Year']), axis=1) \
    + df4.apply(lambda x: MarriedAtTheTime(x['Second Marriage'], x['Second Divorce'],x['Competing Year']), axis=1)\
    + df4.apply(lambda x: MarriedAtTheTime(x['Third Marriage'], x['Third Divorce'],x['Competing Year']), axis=1)\
    + df4.apply(lambda x: MarriedAtTheTime(x['Fourth Marriage'], x['Fourth Divorce'],x['Competing Year']), axis=1)

In [38]:
df4['Within 1 Year']= \
    df4.apply(lambda x: WithinXYears(x['First Divorce'], x['Competing Year'],1), axis=1) \
    + df4.apply(lambda x: WithinXYears(x['Second Divorce'], x['Competing Year'],1), axis=1) \
    + df4.apply(lambda x: WithinXYears(x['Third Divorce'], x['Competing Year'],1), axis=1) \
    + df4.apply(lambda x: WithinXYears(x['Fourth Divorce'], x['Competing Year'],1), axis=1) \

df4['Within 2 Years']= \
    df4.apply(lambda x: WithinXYears(x['First Divorce'], x['Competing Year'],2), axis=1) \
    + df4.apply(lambda x: WithinXYears(x['Second Divorce'], x['Competing Year'],2), axis=1) \
    + df4.apply(lambda x: WithinXYears(x['Third Divorce'], x['Competing Year'],2), axis=1) \
    + df4.apply(lambda x: WithinXYears(x['Fourth Divorce'], x['Competing Year'],2), axis=1) \

df4['Within 3 Years']= \
    df4.apply(lambda x: WithinXYears(x['First Divorce'], x['Competing Year'],3), axis=1) \
    + df4.apply(lambda x: WithinXYears(x['Second Divorce'], x['Competing Year'],3), axis=1) \
    + df4.apply(lambda x: WithinXYears(x['Third Divorce'], x['Competing Year'],3), axis=1) \
    + df4.apply(lambda x: WithinXYears(x['Fourth Divorce'], x['Competing Year'],3), axis=1) \

In [39]:
# 141 were married at the time of competing! A much better sample size!

testingdf = df4[df4['Married at the Time']==1]
testingdf.describe()

Unnamed: 0,Competing Year,First Marriage,First Divorce,Second Marriage,Second Divorce,Third Marriage,Third Divorce,Fourth Marriage,Fourth Divorce,Married at the Time,Within 1 Year,Within 2 Years,Within 3 Years
count,57.0,57.0,57.0,57.0,57.0,57.0,57.0,57.0,57.0,57.0,57.0,57.0,57.0
mean,2011.175439,1997.964912,807.473684,526.157895,70.035088,70.263158,70.421053,35.245614,35.263158,1.0,0.140351,0.140351,0.140351
std,4.480604,14.402337,990.524701,888.277522,370.531759,371.739854,372.576116,266.098551,266.231004,0.0,0.350438,0.350438,0.350438
min,2004.0,1959.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
25%,2007.0,1989.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
50%,2011.0,2001.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
75%,2015.0,2008.0,2000.0,1976.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
max,2019.0,2019.0,2017.0,2019.0,1996.0,2008.0,2014.0,2009.0,2010.0,1.0,1.0,1.0,1.0


In [40]:
df4[df4['Within 1 Year']!=0].shape

(8, 14)

## Section 6: Getting figures

"In England and Wales in 2017, only 8.4 per 1,000 opposite-sex couples got divorced."*
I'm going to use this figure for now because I can't see any better ones for more recent years.

*https://www.theguardian.com/lifeandstyle/2018/dec/09/in-it-for-the-long-haul-why-divorce-rates-are-falling-fast

In [41]:
# Divorce rate in England and Wales on average
ew_rate = 8.4/1000
ew_percent=str(round(ew_rate*100,2))+"%"

In [42]:
# Divorce rate among married-at-the-time Strictly contestants

sample_size = df4['Married at the Time'].sum()
sample_divorces = df4['Within 1 Year'].sum()

scd_rate = sample_divorces/sample_size
scd_percent=str(round(scd_rate*100,2))+"%"

In [43]:
print("England and Wales divorce percentage in a year: " + ew_percent)
print("SCD divorce percentage in a year: " + scd_percent)

England and Wales divorce percentage in a year: 0.84%
SCD divorce percentage in a year: 14.04%


In [44]:
print("Number of times higher Strictly is than normal people: " + str(round(scd_rate/ew_rate,2)))

Number of times higher Strictly is than normal people: 16.71


## Statistical Significant Test

I'm going for a 1-sample test (classic) just to test whether the sample we've picked could possibly have come from a population with a mean of 0.84%, as the ONS published. To be honest, I suspect I haven't got enough datapoints to make them statistically significant.

<em> H0: Strictly contestants have a divorce rate of 0.84% each year
<br> H1: Strictly contestants have a higher divorce rate than 0.84% each year

 PROB(8 or more divorces in 57 happen with a 0.84% rate)
 = 1 - PROB(7 or fewer happen)

In [45]:
def nCr(n, r): 
    return (fact(n) / (fact(r)  
                * fact(n - r))) 
  
# Returns factorial of n 
def fact(n): 
    res = 1
    for i in range(2, n+1): 
        res = res * i 
    return res 

In [46]:
# Calculate the probability of 14 or fewer divorces happening

cum_prob = 1
for i in range(0,sample_divorces):
    this_prob = pow(ew_rate,i)*pow(1-ew_rate,sample_size-i)*nCr(sample_size,i)
    # print('prob of '+str(i)+' divorces: '+str(this_prob))
    cum_prob = cum_prob - this_prob
    
print('PROB(8 or more divorces)= '+str(cum_prob*100)+'%')

PROB(8 or more divorces)= 2.83942613033504e-06%


2.84e-08 is waaay lower than even a 1% significance boundary, so it's definitely statistically significant:

<b> Reject H0 - the mean divorce rate is definitely higher than the English average!

## Finding a Better Mean

I think it's a bit unfair to compare celebrity divorce rates to normal-people divorce rates, so I really want to get a rate for celebrities who may or may not have been on Strictly.

https://yougov.co.uk/ratings/entertainment/fame/people/all has a list of the most famous people in the UK. I'm going to scrape the marriage statuses of this lot, and see if that gives a different number.

In [83]:
# Set up my list of celebrities, and get their spouses

celeb_list = pd.read_csv('FamousList.csv',header=None)
celeb_list = celeb_list.rename(columns = {0:'Name'})
celeb_list['Spouses String']=(celeb_list['Name'].map(lambda x:GetMarriageStatus_fromname(x)))
celeb_list = celeb_list[celeb_list['Spouses String']!=0]

celeb_list['First Marriage']=(celeb_list['Spouses String'].map(lambda x:MarDivList_fromstring(x,0)))
celeb_list['First Divorce']=(celeb_list['Spouses String'].map(lambda x:MarDivList_fromstring(x,1)))
celeb_list['Second Marriage']=(celeb_list['Spouses String'].map(lambda x:MarDivList_fromstring(x,2)))
celeb_list['Second Divorce']=(celeb_list['Spouses String'].map(lambda x:MarDivList_fromstring(x,3)))
celeb_list['Third Marriage']=(celeb_list['Spouses String'].map(lambda x:MarDivList_fromstring(x,4)))
celeb_list['Third Divorce']=(celeb_list['Spouses String'].map(lambda x:MarDivList_fromstring(x,5)))
celeb_list['Fourth Marriage']=(celeb_list['Spouses String'].map(lambda x:MarDivList_fromstring(x,6)))
celeb_list['Fourth Divorce']=(celeb_list['Spouses String'].map(lambda x:MarDivList_fromstring(x,7)))

In [88]:
def DivorcedThisYear(DivOne,DivTwo,DivThree,DivFour,year):
    if (DivOne == year or DivTwo == year or DivThree == year or DivFour == year): 
        return(1)
    else: return(0)

In [106]:
# To work out how many people were married (and divorced) in each year

for year in range(1980,2019):
    celeb_list['married in '+str(year)]=celeb_list.apply(lambda x: MarriedAtTheTime(x['First Marriage'], x['First Divorce'],year), axis=1)\
    +celeb_list.apply(lambda x: MarriedAtTheTime(x['Second Marriage'], x['Second Divorce'],year), axis=1)\
    +celeb_list.apply(lambda x: MarriedAtTheTime(x['Third Marriage'], x['Third Divorce'],year), axis=1)\
    +celeb_list.apply(lambda x: MarriedAtTheTime(x['Fourth Marriage'], x['Fourth Divorce'],year), axis=1)
    celeb_list['divorced in '+str(year)]=celeb_list.apply(lambda x:DivorcedThisYear(x['First Divorce'],x['Second Divorce'],x['Third Divorce'],x['Fourth Divorce'],year),axis=1)
    
yearlydivrates=[]

for year in range(1980,2019):
    thisyeardivrate = celeb_list['divorced in '+str(year)].sum()/celeb_list['married in '+str(year)].sum()
    yearlydivrates.append(thisyeardivrate)
    
# The divorce rate (in an average year) for celebrities is....
celeb_rate = sum(yearlydivrates)/len(yearlydivrates)
print('The celebrity divorce rate in a year is: '+str(celeb_rate))

The celebrity divorce rate in a year is: 0.04859266735022459


## (Hopefully) Slightly Better Significance Test

<em> H0: Strictly contestants have a divorce rate of 4.8% each year <br>
H1: Strictly contestants have a higher divorce rate than 4.8% each year 

In [108]:
# Calculate the probability of 14 or fewer divorces happening

cum_prob = 1
for i in range(0,sample_divorces):
    this_prob = pow(celeb_rate,i)*pow(1-celeb_rate,sample_size-i)*nCr(sample_size,i)
    # print('prob of '+str(i)+' divorces: '+str(this_prob))
    cum_prob = cum_prob - this_prob
    
print('PROB(8 or more divorces)= '+str(cum_prob*100)+'%')

PROB(8 or more divorces)= 0.610466797859054%


Which is STILL way under the 1% significance level, so it's significant!!

# CONCLUSION: There IS a true Strictly Curse!