In [1]:
import pandas as pd
import nltk
from collections import Counter

![logo](logo3.png)

# Rightbound - Home assignment 

***

## Task A - Labeling/Classification

For this part we will analyze the titles dataset and find out the inconsistencies between the function and function_group columns.

Specifically, we want to understand:
1. Why do they appear?
2. How to fix them?
3. How such problems could affect the business?

In [2]:
titles = pd.read_csv('titles_HA.csv')
# We sort the functions values with more than one element.
titles['functions_sort'] = titles['functions'].apply(lambda x: ','.join(sorted(x.split(','))))
# And we lowercase titles
titles['title'] = titles['title'].str.lower() #lowercase all strings
# We count the functions tags assigned to each title
titles['functions_ct'] =  titles['functions'].apply(lambda x: x.count(',')+1)
# And take a look at how the distribute
tags_assigned = titles['functions_ct'].value_counts().reset_index()
tags_assigned.columns = ['tags', 'count']
display(tags_assigned)
print('Titles with only 1 tag assigned: {:.2%}'.format(tags_assigned.iloc[0]['count']/tags_assigned['count'].sum()))

Unnamed: 0,tags,count
0,1,3010
1,2,989
2,3,221
3,4,22
4,5,3
5,6,1


Titles with only 1 tag assigned: 70.89%


### First code anomaly - 
Same functions belong to different function_groups. Specifically, the division between the 'creative and marketing' and the 'marketing and operations' function_group's is not clear.

In [3]:
group = titles.groupby(['function_group', 'functions_sort'])['title'].count().reset_index()
group[group[['functions_sort']].duplicated(keep=False)].sort_values(by=['functions_sort'])

Unnamed: 0,function_group,functions_sort,title
12,Creative,"Brand Marketing,Creative",2
200,Marketing,"Brand Marketing,Creative",1
15,Creative,"Content,Creative",6
217,Marketing,"Content,Creative",3
16,Creative,Creative,7
226,Marketing,Creative,26
267,Marketing,"Marketing,Operations",4
289,Operations,"Marketing,Operations",3


Let's take for example the 'Content,Creative' functions, they can randomly go to the 'creative' or 'marketing' function groups.

In [4]:
(titles[(titles['functions_sort']=='Content,Creative')].drop_duplicates().sort_values(by=['functions','title']))

Unnamed: 0,title,functions,function_group,functions_sort,functions_ct
2131,art director / graphic designer / content supe...,"Content,Creative",Marketing,"Content,Creative",2
1453,"associate creative director, content & copy","Content,Creative",Marketing,"Content,Creative",2
2002,creative and content manager,"Content,Creative",Creative,"Content,Creative",2
2070,creative content project manager,"Content,Creative",Creative,"Content,Creative",2
910,"project management, creative & content team","Content,Creative",Creative,"Content,Creative",2
913,retired director content & creative management,"Content,Creative",Creative,"Content,Creative",2
3020,senior art director - content development,"Content,Creative",Marketing,"Content,Creative",2
3066,senior creative content manager - latam,"Content,Creative",Creative,"Content,Creative",2
3711,senior director of creative services and content,"Content,Creative",Creative,"Content,Creative",2


It **seems like** the 'functions' column is obtained by assigning tags to the different key words from the titles columns. It is less clear how the 'function_group' column is assigned.

We find no reason to classify the 'associate creative director, content & copy' as 'marketing', and the 'senior director of creative services and content' as 'creative'.

**This can be fixed** by referring the function_group column to the functions column. Specifically, all the 'brand marketing,creative' titles should be classified as 'marketing'.

And more generally we can create a dictionary mapping from each functions combination to a certain function_group. 

If we don't fix this error, using the function_group column in our code could omit relevant titles when looking for people in the marketing department.

Let's take a closer look at the Marketing,Operations functions:

In [5]:
with pd.option_context('display.max_colwidth', 250):   
    display(titles[(titles['functions_sort']=='Marketing,Operations')].drop_duplicates().sort_values(by=['functions','title']))

Unnamed: 0,title,functions,function_group,functions_sort,functions_ct
621,cmo/coo borgess ambulatory care,"Marketing,Operations",Operations,"Marketing,Operations",2
2963,executive assistant to president/cmo and coo,"Marketing,Operations",Operations,"Marketing,Operations",2
617,"manager, national operations & trade marketing","Marketing,Operations",Marketing,"Marketing,Operations",2
881,"marketing & operations coordinator, patagonia provisions","Marketing,Operations",Marketing,"Marketing,Operations",2
3038,sr manager- operations & director of marketing,"Marketing,Operations",Marketing,"Marketing,Operations",2
1996,vice president of marketing/merchandise operations,"Marketing,Operations",Marketing,"Marketing,Operations",2


From here we see that marketing and operations are often related, we will later see that very often the operations functions are not assigned to the operations function_group.

We will talk about that more extensively below.

And one more thing executive assistants (row 2963) are important people when trying to get to higher ranks it could be helpful to identify those titles. 

### Second anomaly 

**Other** is the second function_group category with more observations, only after marketing.

In [6]:
counts = titles['function_group'].value_counts().reset_index()
counts.columns = ['function_group', 'counts']
display(counts.head(10))
print('Titles identified as Other on function_group: {:.2%}'
      .format(int(counts.loc[counts['function_group']=='Other','counts'])/counts['counts'].sum()))

Unnamed: 0,function_group,counts
0,Marketing,1422
1,Other,572
2,IT,370
3,Digital,333
4,Sales,282
5,eCommerce,207
6,HR,161
7,Security,157
8,Customer,136
9,Dev,114


Titles identified as Other on function_group: 13.47%


There are a lot of words that appear on the titles columns that are not being considered.

We can count those words and count the ones that appear the most to assign them a function tag.

Otherwise we would be missing many relevant titles.

In [7]:
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

def lemmatize_text(text):
    return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]

titles['lemma'] = titles['title'].apply(lemmatize_text)

freq = Counter(word for sentence in titles.loc[titles['functions']=='Other','lemma'] for word in sentence)
for word, frequency in freq.most_common(40):
    print(word,frequency)

manager 110
director 77
social 63
of 61
service 55
support 35
director, 33
& 30
and 30
lead 28
senior 28
vice 27
manager, 25
communication 21
design 21
executive 20
information 19
- 19
health 18
chief 18
assistant 17
customer 17
business 16
management 16
work 16
corporate 15
president 15
responsibility 14
president, 13
application 10
development 10
officer 10
display 9
sr. 9
data 8
system 8
architect 8
medical 8
specialist 8
case 8


It seems that we are leaving many high rank titles without a description. Words like manager, director, lead, senior, vice, manager, chief, corporate and president are among the ones that appear the most when we filter functions with no category.


### Third anomaly
When there is more than one function, the function_group is assigned most of the time after the last element of the functions column. 

Let's visualize titles with four functions to exemplify this:

In [8]:
with pd.option_context('display.max_colwidth', 250):    
    display(titles[titles['functions_ct']==4].sample(5, random_state=8))

Unnamed: 0,title,functions,function_group,functions_sort,functions_ct,lemma
1054,"assistant production & procurement manager, marketing & creative services","Procurement,Product,Marketing,Creative",Creative,"Creative,Marketing,Procurement,Product",4,"[assistant, production, &, procurement, manager,, marketing, &, creative, service]"
1263,event marketing manager,"Marketing,Event Marketing,Event Manager,Event Organiser",Marketing,"Event Manager,Event Marketing,Event Organiser,Marketing",4,"[event, marketing, manager]"
702,senior director of digital sales and operations,"Digital,Digital,Sales,Operations",Sales,"Digital,Digital,Operations,Sales",4,"[senior, director, of, digital, sale, and, operation]"
2413,"account manager, digital content","Content,Media,Digital,Account Manager",Sales,"Account Manager,Content,Digital,Media",4,"[account, manager,, digital, content]"
3014,"digital & ecommerce brand manager tazo, lipton, pukka & pure leaf","Brand Marketing,Digital,Brand Marketing,eCommerce",eCommerce,"Brand Marketing,Brand Marketing,Digital,eCommerce",4,"[digital, &, ecommerce, brand, manager, tazo,, lipton,, pukka, &, pure, leaf]"


Row 1054: Even when marketing appears among the functions, the row is assigned to Creative.
Row 702: The 'senior director of digital sales and operations' is assigned to the sales function_group and not digital. The thing here is that Sales appears after Digital. 

What about the function operations that appears after sales? Remember that we said that often operations functions are not assigned to the operations function_group? Well, here we think this is the case.

Let's see how general is the rule that we stated here. 



In [9]:
def last_element(row):
    count = row['functions_ct']
    functions = row['functions']
    
    if count == 1:
        last = functions
    else:
        last = functions.rsplit(',', 1)[1]
    return last

titles['last'] = titles.apply(last_element, axis=1)

grouped_last = titles.groupby(['function_group', 'last']).agg({'title': 'count', 'functions_ct':'mean'}).reset_index()
grouped_last.rename(columns={'title':'count', 'functions_ct':'functions_mean'}, inplace=True)
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    display(grouped_last)

Unnamed: 0,function_group,last,count,functions_mean
0,Benefits,General Benefits,1,1.0
1,CEO,CEO,14,1.928571
2,CEO,Operations,1,3.0
3,Creative,Creative,46,2.065217
4,Creative,Operations,4,2.25
5,Creative,Producer,1,2.0
6,Customer,Client Service,9,1.222222
7,Customer,Customer Care,12,1.416667
8,Customer,Customer Experience,42,1.428571
9,Customer,Customer Operations,4,1.0


As we suspected the function_group is assigned after the last function; except for some cases (like operations).

Let's see what other functions are not dominant even when they appear at the end of the string.

In [10]:
print('Last functions assigned to multiple function_groups')
grouped_last[grouped_last['last'].duplicated(keep=False)].sort_values(by='last')

Last functions assigned to multiple function_groups


Unnamed: 0,function_group,last,count,functions_mean
70,Marketing,BI,1,2.0
131,eCommerce,BI,1,2.0
26,Digital,BI,1,3.0
93,Other,BI,8,1.0
52,IT,BI,3,2.0
66,Legal,Content,1,2.0
95,Other,Content,99,1.0
53,IT,Content,1,2.0
3,Creative,Creative,46,2.065217
75,Marketing,Creative,29,1.103448


A few conclusions from the chart above:
1. The functions BI, Content, Creative, Engineering, Operations and Producer are not dominant when they appear on the end.
2. When BI, Content and Claims appear alone, they are assigned to the Other function_group.
3. We couldn't decipher the assignment rule when Creative appears at the end. Similar to what was suggested on anomaly #1, the division between Marketing and Creative is messy.
4. The operations function is not dominant when assigned on the end, but when it appears alone, it is assigned to the operations function_group.
5. Even when Engineering appears at the end, if it is preceded by R&D it goes to the Technology function_group.

The assignment algorithm can be improved by ordering the functions in such a way the functions_group is always assigned following the same rule. For example according to the first or last element of the list.

Also, function_groups can be revisited, we can keep only the ones that group many observations. 

Function_groups Benefits, Design, Real Estate, and Support contain one title each.  


## Task B – Definition of B2C companies
B2C companies differentiate themselves from B2Bs mainly on the fact that transactions are made between one big player and many small players. This idiosyncrasy allows us to think about some criteria that will help us identify such companies:
1. As B2Cs goal is to attract as many customers as possible they usually invest more in marketing. While B2Bs target their customers through a personalized attention, B2Cs usually need exposure.  One of the advantages of using such criteria is the simplicity to separate between different industries depending on our needs. There are plenty of companies that track companies’ advertising expenditures, Kantar is one of them. The availability of this information made me choose the marketing criteria for making the list of 100 B2C. One thing to consider, when using marketing lists, removing food and pharmaceutical brands is crucial for filtering only the B2Cs. These brands usually use other businesses to reach the final consumer.  
2. Another feature that is related to the one above, is their necessity of exposure. Usually, B2Cs are located on the main floors of avenues with high transit. We can obtain the data from Google Maps.
It is important to consider that this is useful only for brick and mortars B2Cs, while the above criteria allow us to find online businesses as well.
3. Given the uneven relationship that exists between vendors and customers on B2C operations, most countries have a government agency that protects the consumer’s rights. From the claims records, we can obtain the names of those companies that are not giving a good service to their customers. 
In Israel, this data can be seen at consumers.org.il
4. With the burst of online shopping many forums where consumers share their reviews have emerged. Marketplaces like Amazon are a go-to place when looking for B2Cs at a detailed level.
5. Another criteria that differentiates between B2Cs and B2B is the average ticket size. When having access to financial statements, we can be confident that a smaller invoice size and a higher number of different sales are both indicators that we are looking at a B2C.
6. There are some countries where the VAT is only added when a final transaction has been made. Looking at sales that generate a consumer’s tax is another way to identify B2C companies.
Most of the business transactions are made between B2Bs, the road for producing a final product is often long, but here we have enumerated some parameters that can help us identify the companies whose operations are directed to the final consumer. 

Please refer to the Excel file named 100B2C. for the list containing the one hundred companies that we identified as B2Cs. 