Import some modules, and set up this notebook to read files correctly.

In [72]:
import os
os.chdir('../..')
from pipelines.util import *
import pandas as pd

Read the data file into a pandas dataframe

In [73]:
data = pd.read_excel(os.path.join(WDIR, 'true-north/true_north_may_2024.xlsx'))
data2 = pd.read_excel(os.path.join(WDIR, 'true-north/true_north_may_2024.xlsx'))

Add a "last_updated" column which uses the 'last activity date' column if it exists, otherwise the create date.

In [74]:
# Currently have both create date (when the person filled the questionnaire for the first time) and the "last_updated"
# date, which is the "Last Activity Date" iff it exists, otherwise it is the "Create Date"
data['last_updated'] = data['Last Activity Date'].combine_first(data['Create Date'])

Write a list of all the columns we don't need, and then remove them (drop) from the dataframe

In [75]:
cols = [
    'Marketing contact status',
    'Last Activity Date',
    'What sector do you operate in?',
    'How did you hear about True North? [Other]',
    'How did you hear about True North? ',
    'Are there any other challenge areas you feel were not identified or reflected in the report? ',
    'Can you share any companies or people in your network who you think would be interested in hearing about or being a part of future True North activity? ',
    'Do you have any further recommendations of how Brabners can activate the True North network, or areas of focus you would recommend? ',
    'Any other comments ',
    'Added To List On',
    'Country/Region',
    'Create Date.1',
    'Last Activity Date.1'
    ]
data.drop(columns=cols, inplace=True)

In the following 4 code blocks this logic applies:
- Make a list of columns that are the same question and need merging. 
- Merge them together with priority according to the order of the list. 
- If an entry exists, use that value. Otherwise, go to the next column.
- Add this as a new, single column with a new title.
- Remove the old columns from the dataframe.

In [76]:
cols = ['What sector(s) does your organisation operate in? [Other]', 'What sector(s) does your organisation operate in??', 'What sector(s) does your organisation operate in?']
# cols are F,G,H
# data['sector'] = data[cols[0]].combine_first(data[cols[1]]).combine_first(data[cols[2]])

def combine_sectors(row):
    if row[cols[2]] == 'Other - please specify':
        return row[cols[0]]
    elif pd.notnull(row[cols[1]]):
        return row[cols[1]]
    elif pd.notnull(row[cols[2]]):
        return row[cols[2]]
    else:
        return None

data['sector'] = data.apply(combine_sectors, axis=1)
data.drop(columns=cols, inplace=True)

In [77]:
cols = ['Where is your organisation based? Select all that apply', 'Where is your organisation based? Select all that apply [Other]']
data['location'] = data[cols[0]].combine_first(data[cols[1]])
data.drop(columns=cols, inplace=True)

In [78]:
cols = ['Which theme of the True North report do you most identify with and could support activity around? (select all that apply)', 
        'Which theme of the True North report do you most identify with and could support activity around? ']
data['Which theme of the True North report do you most identify with and could support activity around?'] = data[cols[0]].combine_first(data[cols[1]])
data.drop(columns=cols, inplace=True)

In [79]:
cols = ['How would you like to be involved with the True North network (select all that apply)? [Other]',
        'How would you like to be involved with the True North network (select all that apply)?']
data['How would you like to be involved with the True North network?'] = data[cols[0]].combine_first(data[cols[1]])
data.drop(columns=cols, inplace=True)

Mapping the company sizes onto the standard ranges. Assuming there are no edge cases.

One company had both 1-10 and 11-50 as a company range. We've opted for the 10-49 range based on probability.

In [80]:
mapper = {'51 - 100': '50-99', '501 - 1,000': '500-999', '11 - 50': '10-49', '1 - 10': '0-9', '101 - 250': '100-249',
 '251 - 500': '250-499', '1 - 10; 11 - 50': '10-49'}
data['company_size'] = data['How many people work in your organisation?'].replace(mapper)
data.drop(columns='How many people work in your organisation?', inplace=True)

Sort the values by last_updated date, with oldest at the top.

In [81]:
data.sort_values(by='last_updated', inplace=True, ascending=False)
data.set_index('last_updated', inplace=True)

We found that some entries are likely to be from the same company. Currently these are:
- TUBR LTD; gettubr.com
- Sitehop; Sitehop.com

We will manually correct these, using the row with the actual company name (not url) as prioirity, and filling blanks with any entries (if present) from the other row.

In [82]:
def merge_duplicate_companies(data, names):
    '''
        Names is a list of 2 company names to merge. 
        First company name in the list is used as priority when merging.
        Function assumes only 2 duplicate entries. Can be adjusted in future if we need to account for n duplicates.
    '''
    # reset the index to access data by a numeric index
    data.reset_index(inplace=True)

    # get the indices of the named rows
    index_a = data.index[data['Company name'] == names[0]]
    index_b = data.index[data['Company name'] == names[1]]

    # combine the rows
    combined_row = data.iloc[index_a].combine_first(data.iloc[index_b])

    # insert the row into the dataframe at index_a 
    data.loc[index_a] = combined_row

    # drop index b as no longer needed, reset the index and drop the old one, set the index to "last_updated".
    data = data.drop(index_b).reset_index(drop=True).set_index('last_updated')

    print('Successfully merged rows and created the following row: \n', data.loc[data['Company name'] == names[0]].to_csv())

    return data

Using the above function, we merge the duplicate rows we spotted in turn by calling the function for each pair of duplicates.

In [83]:
data = merge_duplicate_companies(data, ['TUBR Ltd', 'gettubr.com'])
data = merge_duplicate_companies(data, ['Sitehop', 'sitehop.com'])

Successfully merged rows and created the following row: 
 last_updated,Create Date,Do you feel the True North report identified the key challenges and opportunities facing the region? ,Are you interested in attending future True North events? ,Are you currently a B Corp or in the process of becoming a B Corp?,Would you be interested in hearing more from Brabners about the B Corp process?,Company name,City,Industry,sector,location,Which theme of the True North report do you most identify with and could support activity around?,How would you like to be involved with the True North network?,company_size
2023-12-05 10:34:00,2023-12-05 10:34:00,No,Yes,No,No,TUBR Ltd,London,Computer Software,Hospitality,,Sustainable development,,

Successfully merged rows and created the following row: 
 last_updated,Create Date,Do you feel the True North report identified the key challenges and opportunities facing the region? ,Are you interested in attending future True North events? ,Are you currently a B

Write the data to a csv called 'true_north_may_2024_clean"

In [84]:
data.to_csv(os.path.join(WDIR, 'true-north/true_north_may_2024_clean.csv'))