<img width="8%" alt="Google Sheets.png" src="https://raw.githubusercontent.com/jupyter-naas/awesome-notebooks/master/.github/assets/logos/Google%20Sheets.png" style="border-radius: 15%">

# Google Sheets - Update people database

**Tags:** #googlesheets #gsheet #data #naas_drivers #growth #leads #openai #linkedin #people

**Author:** [Florent Ravenel](https://www.linkedin.com/in/florent-ravenel/)

**Description:** This notebook updates your people database with new people that interacted with content and enrich it with ICP and company.

## Input

### Import libraries

In [None]:
import naas_data_product
import naas
from naas_drivers import gsheet, linkedin
import pandas as pd
import os
from datetime import date
import openai
import time
import re

### Setup variables
**Inputs**
- `entity_dir`: This variable represents the entity directory.
- `input_dir`: Input directory to retrieve file from.
- `file_interactions`: Name of the file to be retrieved.
- `api_key`: LLM API Key.
- `li_at`: Cookie used to authenticate Members and API clients.
- `JSESSIONID`: Cookie used for Cross Site Request Forgery (CSRF) protection and URL signature validation.
- `spreadsheet_url`: Google Sheets spreadsheet URL.
- `sheet_people`: Google Sheets sheet name storing leads profiles.
- `prompt_seniority`: Prompt to be used to find people seniority.
- `prompt_department`: Prompt to be used to find people department.
- `prompt_organization`: Prompt to be used to find people organization.

**Outputs**
- `output_dir`: Output directory to save file to.
- `file_people`: Output file name to save as picke.
- `datalake_dir`: Datalake directory (outputs folder from abi project).

In [None]:
# Inputs
entity_dir = pload(os.path.join(naas_data_product.OUTPUTS_PATH, "entities", "0"), "entity_dir")
input_dir = os.path.join(entity_dir, "growth-engine", date.today().isoformat())
file_interactions = "linkedin_interactions"
api_key = os.environ.get("NAAS_API_TOKEN") or naas.secret.get('NAAS_API_TOKEN')
li_at = os.environ.get("LINKEDIN_LI_AT") or naas.secret.get("LINKEDIN_LI_AT")
JSESSIONID = os.environ.get("LINKEDIN_JSESSIONID") or naas.secret.get("LINKEDIN_JSESSIONID")
spreadsheet_url = pload(os.path.join(naas_data_product.OUTPUTS_PATH, "entities", "0"), "abi_spreadsheet")
sheet_people = "PEOPLE"
prompt_seniority = """
Find the seniority that matches the most with the occupation extracted from a LinkedIn profile, if there is no exact match, you WON'T return a sentence or ask for more information but stricly return "Professional/Staff".
Seniority definition:
- "Entry-Level": Any occupation with Intern/Internship, Trainee, Junior
- "Professional/Staff": [Role] Specialist, [Role] Analyst, [Role] Coordinator.
- "Senior Professional/Staff": Senior [Role] Specialist, Senior [Role] Analyst.
- "Lead/Supervisor": Team Lead, Supervisor.
- "Manager": Manager, [Department] Manager.
- "Senior Manager": Senior Manager, Director.
- "Executive": Vice President, Chief [Role] Officer (CFO, CTO, etc.).
- "Top Executive": President, CEO, Managing Director.
"""
prompt_department = """
Find the department that matches the most with the occupation extracted from a LinkedIn Profile, if there is no exact match, you WON'T return a sentence or ask for more information but stricly return "Not Found".
Use the list below as a starting point to understand the department affiliations.
Remember, roles can vary across industries, and some individuals may wear multiple hats in smaller companies.
Departments: 
- "Human Resources (HR)": handling employee relations, benefits, recruitment, training, and workplace culture. Keywords: "Recruiter," "HR Manager," or "Training Coordinator."
- "Finance": handling company finances, budgets, payroll, taxes, financial reporting, and investment strategies. Keywords: "Financial Analyst," "Accountant," or "Chief Financial Officer (CFO)" 
- "Marketing": handling promotion of the business, market research, developing marketing strategies, and managing advertising. Look for titles like "Marketing Director," "Brand Manager," or "Content Strategist."
- "Sales": handling revenue generation through sales strategies, customer relationships, and managing sales teams. Titles might include "Sales Representative," "Account Executive," or "Sales Manager."
- "Operations": handling day-to-day operations, production efficiency, quality control, and supply chain management. Look for "Operations Manager," "Production Supervisor," or "Supply Chain Analyst."
- "Information Technology (IT)": managing technology infrastructure, supporting systems, ensuring cybersecurity. Titles such as "IT Support Specialist," "Systems Administrator," or "Chief Information Officer (CIO)" are indicative.
- "Research and Development (R&D)": involving the development of new products or services and market research. Look for "Research Scientist," "Product Developer," or "R&D Manager."
- "Customer Service": assisting customers, maintaining satisfaction, and managing feedback. Common titles include "Customer Service Representative," "Support Technician," or "Client Relations Manager."
- "Legal": Occupations handling legal matters, compliance, contracts, and advising on legal risks. Titles like "Corporate Lawyer," "Legal Assistant," or "Compliance Officer" are relevant.
- "Procurement": look for roles related to acquiring goods and services, negotiating with suppliers. Titles might include "Procurement Officer," "Purchasing Manager," or "Supply Coordinator."
- "Quality Assurance (QA)": ensuring products and services meet standards and regulations. Look for "Quality Assurance Specialist," "QA Tester," or "Quality Manager."
- "Logistics and Supply Chain": managing the flow of goods, optimizing delivery routes, and inventory. Titles such as "Logistics Coordinator," "Supply Chain Analyst," or "Fleet Manager" are common.
- "Public Relations (PR)":  managing the company's public image, press releases, and media communications. Look for "Public Relations Specialist," "Communications Director," or "Media Coordinator."
- "Executive Management": high-level decision-makers setting company strategy. Titles include "Chief Executive Officer (CEO)," "Managing Director," or "Executive Vice President."
- "Product Management": overseeing product development and lifecycle. Titles might be "Product Manager," "Product Owner," or "Product Development Lead."
- "Strategy and Business Development": Look for roles identifying new business opportunities and planning. Titles include "Business Development Manager," "Strategic Planner," or "Growth Analyst."
- "Education": Look for roles that try to teach or educate people.
"""
prompt_organization = """
I will give you the occupation from a profile I get from LinkedIn, you will return the company you can extract from by checking the word after 'at' or '@'.
If you don't find it return "NA"
Don't put the results into quotes.
"""

# Outputs
output_dir = os.path.join(entity_dir, "growth-engine", date.today().isoformat())
file_people = "people"
datalake_dir = naas_data_product.OUTPUTS_PATH

## Model

### Get people

In [None]:
df_init = gsheet.connect(spreadsheet_url).get(sheet_name=sheet_people)
if not isinstance(df_init, pd.DataFrame):
    df_init = pd.DataFrame()
print("- People (init):", len(df_init))
# df_init.head(3)

### Get interactions

In [None]:
df_interactions = pload(input_dir, file_interactions)    
print('- Interactions:', len(df_interactions))
# df_interactions.head(3)

### Extract profiles from interactions

In [None]:
def get_date_interaction(
    df_init,
    col_date,
    new_col_date,
):
    # Init
    df = df_init.copy()
    df = df.sort_values(col_date, ascending=True)
    
    # Drop duplicates
    to_keep = [
        "PROFILE_URL",
        col_date,
    ]
    df = df[to_keep].drop_duplicates(["PROFILE_URL"], keep="first").rename(columns={col_date: new_col_date})
    return df.reset_index(drop=True)

def get_sql_date(
    df_init,
    col_date,
    new_col_date,
):
    # Init
    df = df_init.copy()
    df = df.sort_values(["PROFILE_URL", col_date], ascending=[True, True])
    df[col_date] = pd.to_datetime(df[col_date].str[:19])
    
    # Drop duplicates
    to_keep = [
        "PROFILE_URL",
        col_date,
    ]
    df = df.groupby(to_keep, as_index=False).agg({"INTERACTION_SCORE": 'sum'})
    df["INTERACTION_CUM"] = df.groupby("PROFILE_URL").agg({"INTERACTION_SCORE": 'cumsum'})
    df = df[df["INTERACTION_CUM"] >= 3].drop_duplicates(["PROFILE_URL"], keep="first")
    df = df.drop(["INTERACTION_SCORE", "INTERACTION_CUM"], axis=1).rename(columns={col_date: new_col_date})
    return df.reset_index(drop=True)

def get_metadata_by_profile(
    df_init,
    people
):
    # Init
    df = df_init.copy()
    notes = {}
    entities = {}
    
    # Cleaning
    to_select = [
        "ENTITY",
        "PROFILE_URL",
        "FULLNAME",
        "CONTENT_TITLE",
        "CONTENT_URL",
        "INTERACTION",
        "INTERACTION_CONTENT",
        "DATE_INTERACTION"
    ]
    df = df[to_select].sort_values(by="PROFILE_URL").reset_index(drop=True)
    df["INTERACTION_TEXT"] = ""
    df.loc[df["INTERACTION"] == "POST_REACTION", "INTERACTION_TEXT"] = df["FULLNAME"]  + " sent '" + df["INTERACTION_CONTENT"].str.lower() + "' reaction to '" + df["CONTENT_TITLE"].str.strip() + "' (" + df["CONTENT_URL"] + ") on " + df["DATE_INTERACTION"].astype(str)
    df.loc[df["INTERACTION"] == "POST_COMMENT", "INTERACTION_TEXT"] = df["FULLNAME"]  + " commented '" + df["INTERACTION_CONTENT"].str.capitalize() + "' on '" + df["CONTENT_TITLE"].str.strip() + "' (" + df["CONTENT_URL"] + ") on " + df["DATE_INTERACTION"].astype(str)

    # Loop on people (LinkedIn URL)
    for p in people:
        tmp_df = df.copy()
        tmp_df = tmp_df[tmp_df["PROFILE_URL"] == p].reset_index(drop=True)
        interactions = []
        owners = ", ".join(tmp_df["ENTITY"].unique().tolist())
        for row in tmp_df.itertuples():
            # Append interaction text to create notes
            interaction_text = row.INTERACTION_TEXT
            interactions.append(interaction_text)
            
        # Add list to dict
        notes[p] = interactions
        entities[p] = owners
    return notes, entities

def create_db_people(
    df_init,
    df_interactions,
    output_dir,
):
    # Init - Filter on profile
    df_people = df_interactions[df_interactions["PROFILE_URL"].str.contains("https://www.linkedin.com/in/.+")]
    df_direct = df_people.copy()
    df_score = df_people.copy()
    
    # Get first interaction -> Created date AND MQL date
    df_created = get_date_interaction(
        df_people,
        "DATE_INTERACTION",
        "CREATED_DATE",
    )
    df_created["MQL_DATE"] = df_created["CREATED_DATE"]
    
    # Get profile information with last content interaction -> Last interaction date
    to_keep = [
        "PROFILE_URL",
        "FIRSTNAME",
        "LASTNAME",
        "OCCUPATION",
        "PUBLIC_ID",
        "DATE_INTERACTION",
    ]
    df_direct = df_direct[to_keep].drop_duplicates(["PROFILE_URL"])
   
    # Get interactions score by profile
    df_score = df_score.groupby(["PROFILE_URL"], as_index=False).agg({"INTERACTION_SCORE": "sum"})
    
    # Merge dfs
    df = pd.merge(df_created, df_direct, how="left").fillna("NA")
    df = pd.merge(df, df_score, how="left")
    
    # Get more than 3 interactions date -> SQL date
    df_sql = get_sql_date(
        df_people,
        "DATE_INTERACTION",
        "SQL_DATE",
    )
    
    # Merge dfs and cleaning: Rename columns
    to_rename = {
        "DATE_INTERACTION": "LAST_INTERACTION_DATE",
    }
    df = pd.merge(df, df_sql, how="left").rename(columns=to_rename).fillna("NA")

    # Cleaning: Remove emojis from name and occupation
    df["FIRSTNAME"] = df.apply(lambda row: remove_emojis(str(row["FIRSTNAME"])), axis=1)
    df["LASTNAME"] = df.apply(lambda row: remove_emojis(str(row["LASTNAME"])), axis=1)
    df["OCCUPATION"] = df.apply(lambda row: remove_emojis(str(row["OCCUPATION"])), axis=1)
    df["FULLNAME"] = df["FIRSTNAME"] + " " + df["LASTNAME"]
    df["SCENARIO"] = pd.to_datetime(df["CREATED_DATE"].str[:19]).dt.strftime("W%W-%Y")
    
    # Create notes from interactions
    profiles = df["PROFILE_URL"].unique()
    notes, entities = get_metadata_by_profile(df_people, profiles)
    df["NOTES"] = df["PROFILE_URL"].map(notes)
    df["ENTITY"] = df["PROFILE_URL"].map(entities)
    
    # Cleaning: Sort values
    df = df.sort_values(by=["INTERACTION_SCORE", "LAST_INTERACTION_DATE"], ascending=[False, False])
    
    # Get meta data from existing people
    col_ref = [
        "PROFILE_URL",
        "ORGANIZATION",
        "SENIORITY",
        "DEPARTMENT",
        "CRM_CONTACT_ID",
    ]
    for c in col_ref:
        # If columns does not exist, init value to be determined (TBD)
        if not c in df_init.columns:
            df_init[c] = "TBD"
    ref = df_init[col_ref]

    # Merge to get meta data
    df = pd.merge(df, ref, how="left").fillna("TBD")
        
    # Cleaning
    to_order = [
        "ENTITY",
        "SCENARIO",
        "CREATED_DATE",
        "FIRSTNAME",
        "LASTNAME",
        "FULLNAME",
        "OCCUPATION",
        "SENIORITY",
        "DEPARTMENT",
        "ORGANIZATION",
        "INTERACTION_SCORE",
        "MQL_DATE",
        "SQL_DATE",
        "LAST_INTERACTION_DATE",
        "NOTES",
        "PROFILE_URL",
        "PUBLIC_ID",
        "CRM_CONTACT_ID"
    ]
    df = df[to_order]

    # Save database
    pdump(output_dir, df, "people_init")
    return df.reset_index(drop=True)

db_people = create_db_people(
    df_init,
    df_interactions,  
    output_dir,
)
print("- People:", len(db_people))
db_people.head(3)

### Enrich people with company, seniority, department

In [None]:
def enrich_from_occupation(
    occupation,
    key,
    keys,
    api_key,
    prompt,
    file,
    output_dir,
):
    result = "NA"
    if key not in keys:
        result = create_chat_completion(api_key, prompt, occupation, completion="naas").replace("'", "").replace('"', '')
        keys[key] = result
    else:
        print("✅ Result already in dict")
        result = keys.get(key)
    pdump(output_dir, keys, file)
    return result

def enrich_people(
    df_init,
    api_key,
    prompt_seniority,
    prompt_department,
    prompt_organization,
    output_dir,
    datalake_dir,
    limit_linkedin=30
):
    # Init
    df = df_init.copy()
    
    # Filter data
    filter_df = df[
        (df["SENIORITY"].isin(["TBD"])) &
        (df["DEPARTMENT"].isin(["TBD"])) &
        (df["ORGANIZATION"].isin(["TBD", "NA"]))
    ]
    print("-> People to be updated:", len(filter_df))

    # Get people seniority
    people_seniority = get_dict_from_df(df, "SENIORITY", "PROFILE_URL", "people_seniority", output_dir)
    
    # Get people department
    people_department = get_dict_from_df(df, "DEPARTMENT", "PROFILE_URL", "people_department", output_dir)

    # Get companies
    people_organizations = get_dict_from_df(df, "ORGANIZATION", "PROFILE_URL", "people_organizations", output_dir)
    
    # Loop on profile
    count = 1
    call_linkedin = 0    
    for row in filter_df.itertuples():
        index = row.Index
        fullname = row.FULLNAME
        occupation = row.OCCUPATION
        profile_url = row.PROFILE_URL
        seniority = row.SENIORITY
        department = row.DEPARTMENT
        organization = row.ORGANIZATION
        interaction_score = row.INTERACTION_SCORE
        print(f"{count} - Starting with  '{fullname}' ({profile_url})")
        
        # Update ICP and Company from OpenAI
        if (seniority == "TBD" or department == "TBD" or organization == "TBD") and api_key != "NA" and (interaction_score >= 3 or count <= 100):
            print()
            print(f"🤖 Finding seniority, departement & company from occupation: {occupation}")
            try:
                seniority = enrich_from_occupation(
                    occupation,
                    profile_url,
                    people_seniority,
                    api_key,
                    prompt_seniority,
                    "people_seniority",
                    output_dir
                )
                department = enrich_from_occupation(
                    occupation,
                    profile_url,
                    people_department,
                    api_key,
                    prompt_department,
                    "people_department",
                    output_dir
                )
                organization = enrich_from_occupation(
                    occupation,
                    profile_url,
                    people_organizations,
                    api_key,
                    prompt_organization,
                    "people_organizations",
                    output_dir
                )
            except Exception as e:
                print(e)
            df.loc[index, "SENIORITY"] = seniority.strip()
            df.loc[index, "DEPARTMENT"] = department.strip()
            df.loc[index, "ORGANIZATION"] = organization.strip()
            print("- Seniority:", seniority)
            print("- Department:", department)
            print("- Organization:", organization)
            
        # Update Company info
        if organization == "NA" and interaction_score >= 3 and call_linkedin < limit_linkedin:
            print(f"🕸️ Finding LinkedIn company (interaction score: {interaction_score})")
            company_name = "Not Found"
            linkedin_dir = os.path.join(datalake_dir, "datalake", "linkedin", "profiles")
            linkedin_id = profile_url.split("/")[-1]
            tmp_df = pload(linkedin_dir, f"{linkedin_id}_linkedin_top_card")
            if tmp_df is None:
                # Get Top Card
                try:
                    tmp_df = linkedin.connect(li_at, JSESSIONID).profile.get_top_card(profile_url)
                    pdump(linkedin_dir, tmp_df, f"{linkedin_id}_linkedin_top_card")
                    time.sleep(2)
                    call_linkedin += 1
                    print("- ⚠️ LinkedIn call:", call_linkedin)
                except Exception as e:
                    print(e)
                    organization = "ERROR_LINKEDIN_ENRICHMENT"
                    tmp_df = pd.DataFrame()
            if len(tmp_df) > 0:
                organization = tmp_df.loc[0, "COMPANY_NAME"]
            df.loc[index, "ORGANIZATION"] = str(organization).replace("None", "Not Found").replace("NA", "Not Found").strip()
            print("- Organization:", organization)
            if call_linkedin >= limit_linkedin:
                print("🛑 Call LinkedIn reached:", limit_linkedin)
        count += 1
        print()
    return df.reset_index(drop=True)
    
df_people = enrich_people(
    db_people,
    api_key,
    prompt_seniority,
    prompt_department,
    prompt_organization,
    output_dir,
    datalake_dir,
)  

## Output

### Save data

In [None]:
pdump(output_dir, df_people, file_people)

### Send "People" to spreadsheet

In [None]:
df_check = pd.concat([df_init.astype(str), df_people.astype(str)]).drop_duplicates(keep=False)
if len(df_check) > 0:
    gsheet.connect(spreadsheet_url).send(data=df_people, sheet_name=sheet_people, append=False)
else:
    print("Noting to update in Google Sheets!")    