# Charity Register Data Scrape
## Louis Othen
The purpose of this script is to scrape of all charities found on the Charity register owned by the governments Charity Commission. the register can be found here: 
https://www.gov.uk/find-charity-information

The reason for the scrape is to collect information on a charity, what it does, what contact information it has, and format into a table. This table will be joined with another previous created before sending a mass email out to them to offer my volunteering services in data.

# Import the relevant libraries

In [77]:
import pandas as pd

from selenium                       import webdriver
from selenium.webdriver.common.by   import By
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup as bs
from tqdm import tqdm 
import requests as req
import time as tim
from lxml import etree
#from ydata_profiling import ProfileReport  #Stopped working since python 3.12
import re
from plotly import express as px
from datetime import datetime as dti

## Inital Scrape

### Set up with selenium to get past cookies

In [37]:
# Open up test browser window
#-----------------------------------------------------
ded = webdriver.Chrome()

# Navigate to charity register screen 
#-----------------------------------------------------
ded.get('https://register-of-charities.charitycommission.gov.uk/charity-search/-/results/page/1/delta/20/sorted-by/charity-income/desc')

# Accept cookies
#-----------------------------------------------------
cookies_btn = ded.find_element(By.ID,'_com_placecube_cookieconsent_web_portlet_CookieConsentPortlet_acceptAllCookies')
cookies_btn.click()

In [38]:
# Get list of page numbers
#-----------------------------------------------------
end_page_len = ded.find_elements(By.ID, "_uk_gov_ccew_portlet_CharitySearchPortlet_pageIterator")

# Get value of last page number 
#-----------------------------------------------------
pe = end_page_len[0].text.split(sep='\n')
end_page_num = int(pe[-2])

### Using requests to parse the contents of each page

In [None]:
header = {
            "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
            "X-Requested-With": "XMLHttpRequest"
        }

# Empty list ready to store looped data 
#-----------------------------------------------------
collected_data  = []

# For loop over webpages, to download high level charity data
#-----------------------------------------------------
for i in tqdm(range(1,end_page_num)):
    url_use = 'https://register-of-charities.charitycommission.gov.uk/charity-search/-/results/page/' + str(i) + '/delta/20/sorted-by/charity-income/desc'#
    print(url_use)
    r= req.get(url_use, headers = header)
    collected_data.append(pd.read_html(r.text)[0])

# Save collected list as a pandas DataFrame
#-----------------------------------------------------
result = pd.concat(collected_data)

# Remove unneeded columns
#-----------------------------------------------------
result.drop(columns=['level_0', 'index'], inplace=True)
    

In [None]:
# See results of scrape before saving down into CSV file.
#-----------------------------------------------------
#result = pd.concat(collected_data)
display(result)
result.to_csv('D:\Coding\Charity_Scrape\charityregister.csv')

## Now to get the child data from each individual link  

### Set up dataframe with columns to mine from webpages

In [None]:
# Read in previously farmed dataframe of summary pages
#-----------------------------------------------------
result = pd.read_csv('D:\Coding\Charity_Scrape\charityregister.csv')

# Remove records with Nulls - not useful at this stage - may be worth looking at later
#-----------------------------------------------------
result = result.dropna()

result.drop(columns = ['Unnamed: 0'], inplace = True) # weird index as per previous download -  remove

# Set up columns ready for further mining
#-----------------------------------------------------
result['Charity number'] = result['Charity number'].astype(str).str.split('.').str[0]
result['overview'] = ''
result['what_it_does'] = ''
result['who_it_benefits'] = ''
result['how_its_done'] = ''
result['where_they_operate'] = ''
result['address'] = ''
result['phone'] = ''
result['email'] = ''
result['website'] = ''

# get a sample to play with
testo = result


In [None]:
# function to automate 1) get link and content , and prep for xpath extraction
#-----------------------------------------------------
def get_page(url):
    header =    {
                    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
                    "X-Requested-With": "XMLHttpRequest"
                }
    try:
        
        # page = req.session()
        # page.cookies.update({'__hs_opt_out': 'no'})

        url = url
        charity_page = req.get(url=url_use, headers=header)

        soup = bs(charity_page.text,'lxml')
        dom = etree.HTML(str(soup))
    
        return dom
    
    except requests.exceptions.RequestException as e:
        print(f"Error fetching page: {e}")
        return None

In [None]:
# Loop through each charity details and store it 

for index, row in tqdm(testo.iterrows(), total = testo.shape[0]):
     url_use = 'https://register-of-charities.charitycommission.gov.uk/charity-search/-/charity-details/' + row['Charity number']

     xpage = get_page(url_use)
     try:
          testo.loc[index, 'overview']            = xpage.xpath('//*[@id="_uk_gov_ccew_onereg_charitydetails_web_portlet_CharityDetailsPortlet_mainContent"]/div[1]/div/p')[0].text
     except IndexError:
          testo.loc[index, 'overview']            = "N/A"

     url_use = url_use + '/what-who-how-where'
     wpage = get_page(url_use)

     try:
          testo.loc[index, 'what_it_does']        = wpage.xpath('//*[@id="_uk_gov_ccew_onereg_charitydetails_web_portlet_CharityDetailsPortlet_mainContent"]/div[2]/div[2]/ul/li')[0].text
          testo.loc[index, 'who_it_benefits']     = wpage.xpath('//*[@id="_uk_gov_ccew_onereg_charitydetails_web_portlet_CharityDetailsPortlet_mainContent"]/div[3]/div[2]/ul/li')[0].text
          testo.at[index, 'how_its_done']         = wpage.xpath('//*[@id="_uk_gov_ccew_onereg_charitydetails_web_portlet_CharityDetailsPortlet_mainContent"]/div[4]/div[2]/ul/li/text()')
          testo.at[index, 'where_they_operate']   = wpage.xpath('//*[@id="_uk_gov_ccew_onereg_charitydetails_web_portlet_CharityDetailsPortlet_mainContent"]/div[5]/div[2]/ul/li/text()')
     except IndexError:
          testo.loc[index, 'what_it_does']        = "N/A"
          testo.loc[index, 'who_it_benefits']     = "N/A"
          testo.loc[index, 'how_its_done']        = "N/A"
          testo.loc[index, 'where_they_operate']  = "N/A"
     url_use = url_use.replace('/what-who-how-where', '')

     url_use = url_use + '/contact-information'
     cpage = get_page(url_use)


     try:
          testo.at[index, 'address']              = cpage.xpath('//*[@id="_uk_gov_ccew_onereg_charitydetails_web_portlet_CharityDetailsPortlet_mainContent"]/dl/div[1]/dd/text()')
          testo.loc[index, 'phone']               = cpage.xpath('//*[@id="_uk_gov_ccew_onereg_charitydetails_web_portlet_CharityDetailsPortlet_mainContent"]/dl/div[2]/dd/text()')
                                                                 
          testo.loc[index, 'email']               = cpage.xpath('//*[@id="charity-contact-email-link"]/text()')
          testo.loc[index, 'website']             = cpage.xpath('//*[@id="charity-contact-website-link"]/text()')
     except Exception:
          testo.loc[index, 'address']             = "N/A"
          testo.loc[index, 'phone']               = "N/A"
          testo.loc[index, 'email']               = "N/A"
          testo.loc[index, 'website']             = "N/A"
     
     tim.sleep(0.05)

     

     

Save this Data!

In [None]:
testo.to_csv('D:\Coding\Charity_Scrape\charityregister_detail_raw.csv')

## Data Cleaning
now that all the data we need has been farmed / mined, next phase is to clean things up before performing Exploratory data analysis upon it

 bring the data back in 

In [69]:
raw_data = pd.read_csv("D:\Coding\Charity_Scrape\charityregister_detail_raw.csv")

clean = raw_data
clean.dropna(
                            subset = [
                                        'overview'
                                        ,'what_it_does'
                                        ,'who_it_benefits'
                                        ,'how_its_done'
                                        ,'where_they_operate'
                                        ,'address'
                                        ,'phone'
                                        ,'email'
                                        ,'website'
                                     ]
                           ,inplace= True
                        )
clean.reset_index(inplace = True)
clean.drop(columns=['Unnamed: 0', 'Reporting', 'Status','how_its_done','where_they_operate','Charity number','index'], inplace=True)

  raw_data = pd.read_csv("D:\Coding\Automation\charityregister_detail_raw.csv")


In [78]:
clean['overview'] = clean['overview'].str.replace('\n', '').str.replace('\t', '').str.replace("'",'')

# Removed or now as messy from mining, but could use later
#--------------------------------------------------------------------------------------
#clean['how_its_done'] = clean['how_its_done'].str.replace('[', '').str.replace("'",'')
#clean['how_its_done'] = clean['how_its_done'].str.replace(']', '').str.replace("'",'')    
#clean['where_they_operate'] = clean['where_they_operate'].str.replace('[', '').str.replace(']', '').str.replace("'",'')

clean['website'] = clean['website'].str.replace('[','').str.replace(']', '').str.replace("'",'')
clean['email'] = clean['email'].str.replace('[','').str.replace(']', '').str.replace("'",'')
clean['phone'] = clean['phone'].apply(lambda x: re.findall(r'\d+', str(x))).apply(lambda x: ''.join(map(str,x)))
clean['address'] = clean['address'].apply(lambda x: re.findall(r'\w+\d*\s+',str(x))).apply(lambda x: [item.replace('t ', '')  if item == 't ' else item for item in x ]).apply(lambda x: ' '.join(x)).apply(lambda x: re.sub(r'\s+', ' ',x).strip())

clean['Income'] = clean['Income'].str.replace('£','').str.replace(',','')

clean = clean.applymap(lambda x: x.lower())

In [79]:
clean.head()

Unnamed: 0,Charity name,Income,overview,what_it_does,who_it_benefits,address,phone,email,website
0,the charities aid foundation,1044275000,charities aid foundation (caf) is a registered...,general charitable purposes,other charities or voluntary bodies,charities aid foundation 25 kings hill avenue ...,3000123088,companysecretary@cafonline.org,www.cafonline.org
1,nuffield health,995600000,"to advance, promote and maintain health and he...",children/young people,provides services,nuffield health epsom gateway 2 ashley avenue ...,7824482211,iben.thomson@nuffieldhealth.com,www.nuffieldhealth.com
2,the arts council of england,941845043,arts council england works to get great art to...,arts/culture/heritage/science,other charities or voluntary bodies,arts council england the hive 49 lever street ...,1619344317,enquiries@artscouncil.org.uk,www.artscouncil.org.uk
3,the british council,896655974,the british council creates friendly knowledge...,the general public/mankind,makes grants to individuals,1 redman place stratford london united kingdom,1619577755,ceo.office@britishcouncil.org,www.britishcouncil.org
4,the national trust for places of historic inte...,643329000,to look after places of historic interest or n...,the general public/mankind,provides buildings/facilities/open space,national trust heelis kemble drive swindon,1793817400,thesecretary@nationaltrust.org.uk,www.nationaltrust.org.uk


In [42]:
# Currently not working since updating python to 3.12
#ProfileReport(clean)

# Lets save it for more stuff later ! 


In [80]:
clean.to_csv("D:\Coding\Charity_Scrape\charityregister_detail_clean.csv")

# Read it back in for analysis of what we want to go for 

In [8]:
data = pd.read_csv("D:\Coding\Charity_Scrape\charityregister_detail_clean.csv")

In [9]:
data.head()
data.drop(columns=['Unnamed: 0'], inplace=True)

bring previous charity data and join them together


In [11]:
prev_data = pd.read_csv("D:\Coding\Charity_Scrape\charity_email_list.csv")
final_data = pd.concat([data, prev_data])
final_data.columns

Index(['Charity name', 'Income', 'overview', 'what_it_does', 'who_it_benefits',
       'address', 'phone', 'email', 'website'],
      dtype='object')

In [13]:
final_data

Unnamed: 0,Charity name,Income,overview,what_it_does,email,website
0,the charities aid foundation,1044275000,charities aid foundation (caf) is a registered...,general charitable purposes,companysecretary@cafonline.org,www.cafonline.org
1,nuffield health,995600000,"to advance, promote and maintain health and he...",children/young people,iben.thomson@nuffieldhealth.com,www.nuffieldhealth.com
2,the arts council of england,941845043,arts council england works to get great art to...,arts/culture/heritage/science,enquiries@artscouncil.org.uk,www.artscouncil.org.uk
3,the british council,896655974,the british council creates friendly knowledge...,the general public/mankind,ceo.office@britishcouncil.org,www.britishcouncil.org
4,the national trust for places of historic inte...,643329000,to look after places of historic interest or n...,the general public/mankind,thesecretary@nationaltrust.org.uk,www.nationaltrust.org.uk
...,...,...,...,...,...,...
13,Macmillan,0,cancer support,cancer support,contact@macmillan.org.uk,https://www.macmillan.org.uk/
14,Revitalise,0,repsirte for carer and disabled,repsirte for carer and disabled,info@revitalise.org.uk,https://revitalise.org.uk
15,The Earl of Southampton Trust,0,local housing for the elderly,local housing for the elderly,info@eost.org.uk,https://eost.org.uk
16,Dogs Trust,0,Dog Rehoming,Dog Rehoming,info@dogstrust.org.uk,https://www.dogstrust.org.uk


In [16]:
final_data.to_csv("D:\Coding\Charity_Scrape\charity_email_merged.csv")

# Analyse what is there and remove unwanted

In [17]:
df = pd.read_csv("D:\Coding\Charity_Scrape\charity_email_merged.csv")
df.drop(columns=['Unnamed: 0'], inplace=True)
df.columns
df.shape

(12144, 6)

Get rid of more charities within the category chosen, such as religious or partisan related. Essentially, avoid anything that could be contrived as 'controversial'.


In [25]:
words_to_remove = ['islam', 'islamic','diocese','synagogue','ARTHRITIS','HORSE','CHURCH','GOD','Religion','Catholic','MOTORISTS','Christian'
                   ,'SPIRITUAL','COAL','Gospel','THEATRE','Missionary','VEGAN','CINEMA','evange','football','DANCE','choir'
                   ,'estate','trains','Holocaust','ROMAN','SPORTS','GALLERY','opera','CLUB','POLISH','MANOR','LEISURE','worship'
                   ,'chapel','cats','BUDDHIST','League','Sisters','museum','sheep','breeders','BOTANICAL','NAUTICAL', 'BUILDINGS'
                   ,'SIKH','chinese','ART','AGRICULTURAL','HALL','DIOCESAN','PRODUCTIONS','COLLECTION','PALACE'
                   ,'ORCHESTRA','WHALE', 'DOLPHIN','Christian','BIBLE','CHRIST','theological','Jesus','festival']

cols_to_search = ['Charity name']

row_rm = c_data[cols_to_search].apply(lambda x: x.str.contains('|'.join(words_to_remove), case=False, na=False)).any(axis=1)

c_data = c_data[~row_rm]

c_data.shape

(9231, 6)

In [54]:
# Save final filtered data ready to choose charities to approach
#-----------------------------------------------------
c_data.to_csv("D:\Coding\Automation\charity_email_final.csv")

# Now to look at charities with a particular condition to focus on.


In [78]:
c_data = pd.read_csv("D:/Coding/git_projects/Charity_Scrape/charity_email_final.csv")
display(c_data)

Unnamed: 0,Charity name,Income,overview,what_it_does,email,website
0,marlborough college,55465000,marlborough college is incorporated by royal c...,children/young people,sslamb@marlboroughcollege.org,www.marlboroughcollege.org
1,aga khan foundation (united kingdom),57059000,the akf(uk) is an affiliate of the internation...,children/young people,front.office@akdn.org,https://www.akf.org.uk
2,millfield,57879000,the education of children and young people wit...,children/young people,summerhayes.r@millfieldschool.com,www.millfieldschool.com
3,st aubyn's (woodford green) school trust,59238401,"the school continues to provide a first-rate, ...",children/young people,bursar@staubyns.com,www.staubyns.com
4,actionaid,59635375,actionaid uk is part of an international feder...,education/training,mail@actionaid.org,www.actionaid.org.uk
5,coif charities investment fund,60607000,investment fund,other defined groups,jackie.fox@ccla.co.uk,www.ccla.co.uk
6,alternative futures group limited,61354000,the charity operates in the north west of engl...,elderly/old people,ask@afgroup.org.uk,www.afgroup.org.uk
7,the wellington college,62581000,the charity aims to provide a world-class educ...,children/young people,sjxc@wellingtoncollege.org.uk,www.wellingtoncollege.org.uk
8,the brandon trust,63004376,"brandon trust exists to improve lifestyles, op...",people with disabilities,info@brandontrust.org,www.brandontrust.org
9,the donkey sanctuary,63368000,to provide rescue homes and treatment for donk...,children/young people,karen.terrey@thedonkeysanctuary.org.uk,www.thedonkeysanctuary.org.uk


In [79]:
c_types = set(c_data['what_it_does'])
c_types

{'arts/culture/heritage/science',
 'children/young people',
 'education/training',
 'elderly/old people',
 'general charitable purposes',
 'other charities or voluntary bodies',
 'other defined groups',
 'people with disabilities',
 'the prevention or relief of poverty'}

In [80]:
to_send = c_data.head(n = 100)

to_send

Unnamed: 0,Charity name,Income,overview,what_it_does,email,website
0,marlborough college,55465000,marlborough college is incorporated by royal c...,children/young people,sslamb@marlboroughcollege.org,www.marlboroughcollege.org
1,aga khan foundation (united kingdom),57059000,the akf(uk) is an affiliate of the internation...,children/young people,front.office@akdn.org,https://www.akf.org.uk
2,millfield,57879000,the education of children and young people wit...,children/young people,summerhayes.r@millfieldschool.com,www.millfieldschool.com
3,st aubyn's (woodford green) school trust,59238401,"the school continues to provide a first-rate, ...",children/young people,bursar@staubyns.com,www.staubyns.com
4,actionaid,59635375,actionaid uk is part of an international feder...,education/training,mail@actionaid.org,www.actionaid.org.uk
5,coif charities investment fund,60607000,investment fund,other defined groups,jackie.fox@ccla.co.uk,www.ccla.co.uk
6,alternative futures group limited,61354000,the charity operates in the north west of engl...,elderly/old people,ask@afgroup.org.uk,www.afgroup.org.uk
7,the wellington college,62581000,the charity aims to provide a world-class educ...,children/young people,sjxc@wellingtoncollege.org.uk,www.wellingtoncollege.org.uk
8,the brandon trust,63004376,"brandon trust exists to improve lifestyles, op...",people with disabilities,info@brandontrust.org,www.brandontrust.org
9,the donkey sanctuary,63368000,to provide rescue homes and treatment for donk...,children/young people,karen.terrey@thedonkeysanctuary.org.uk,www.thedonkeysanctuary.org.uk


#  Compile final list to email and dated ( so to track who i emailed) 

In [81]:
charity_filepath = 'D:\\Coding\\git_projects\\Charity_Scrape\\' + 'charity_email_sent_' + dti.now().strftime("%Y-%m-%d") + '.csv'
print(charity_filepath)
to_send.to_csv(charity_filepath)

D:\Coding\git_projects\Charity_Scrape\charity_email_sent_2024-02-08.csv


In [82]:
# c_data
to_send

Unnamed: 0,Charity name,Income,overview,what_it_does,email,website
0,marlborough college,55465000,marlborough college is incorporated by royal c...,children/young people,sslamb@marlboroughcollege.org,www.marlboroughcollege.org
1,aga khan foundation (united kingdom),57059000,the akf(uk) is an affiliate of the internation...,children/young people,front.office@akdn.org,https://www.akf.org.uk
2,millfield,57879000,the education of children and young people wit...,children/young people,summerhayes.r@millfieldschool.com,www.millfieldschool.com
3,st aubyn's (woodford green) school trust,59238401,"the school continues to provide a first-rate, ...",children/young people,bursar@staubyns.com,www.staubyns.com
4,actionaid,59635375,actionaid uk is part of an international feder...,education/training,mail@actionaid.org,www.actionaid.org.uk
5,coif charities investment fund,60607000,investment fund,other defined groups,jackie.fox@ccla.co.uk,www.ccla.co.uk
6,alternative futures group limited,61354000,the charity operates in the north west of engl...,elderly/old people,ask@afgroup.org.uk,www.afgroup.org.uk
7,the wellington college,62581000,the charity aims to provide a world-class educ...,children/young people,sjxc@wellingtoncollege.org.uk,www.wellingtoncollege.org.uk
8,the brandon trust,63004376,"brandon trust exists to improve lifestyles, op...",people with disabilities,info@brandontrust.org,www.brandontrust.org
9,the donkey sanctuary,63368000,to provide rescue homes and treatment for donk...,children/young people,karen.terrey@thedonkeysanctuary.org.uk,www.thedonkeysanctuary.org.uk


if these are in original register, remove them

In [75]:

c_data.drop(labels=to_send.index, axis=0, inplace=True)

c_data.to_csv('D:/Coding/git_projects/Charity_Scrape/charity_email_final.csv')

c_data

Unnamed: 0,Charity name,Income,overview,what_it_does,email,website
100,marlborough college,55465000,marlborough college is incorporated by royal c...,children/young people,sslamb@marlboroughcollege.org,www.marlboroughcollege.org
101,aga khan foundation (united kingdom),57059000,the akf(uk) is an affiliate of the internation...,children/young people,front.office@akdn.org,https://www.akf.org.uk
102,millfield,57879000,the education of children and young people wit...,children/young people,summerhayes.r@millfieldschool.com,www.millfieldschool.com
103,st aubyn's (woodford green) school trust,59238401,"the school continues to provide a first-rate, ...",children/young people,bursar@staubyns.com,www.staubyns.com
104,actionaid,59635375,actionaid uk is part of an international feder...,education/training,mail@actionaid.org,www.actionaid.org.uk
105,coif charities investment fund,60607000,investment fund,other defined groups,jackie.fox@ccla.co.uk,www.ccla.co.uk
106,alternative futures group limited,61354000,the charity operates in the north west of engl...,elderly/old people,ask@afgroup.org.uk,www.afgroup.org.uk
107,the wellington college,62581000,the charity aims to provide a world-class educ...,children/young people,sjxc@wellingtoncollege.org.uk,www.wellingtoncollege.org.uk
108,the brandon trust,63004376,"brandon trust exists to improve lifestyles, op...",people with disabilities,info@brandontrust.org,www.brandontrust.org
109,the donkey sanctuary,63368000,to provide rescue homes and treatment for donk...,children/young people,karen.terrey@thedonkeysanctuary.org.uk,www.thedonkeysanctuary.org.uk
