# Jiu-Jitsu in Canada
The goal of this project is to find all jiu-jitsu academies primarily across north america in order to analyze the information and devise a multi-channel sales prospecting process for Keystone Kimonos.

To acomplish this goal, the idea is to create a dataframe of jiu-jitsu academies, that can be offloaded and onloaded onto this python script. The dataframe will hold items scraped from online pages with the following attributes: academy name, address, province, country,timmings, phone, email, website.

We can also employ data visualization tools empowered by text retrival and search engine tools to rank and derive knowledge from the collect data in dataframe.

## Scraping First Layer of Data for the Dataframe
The objective in this section is to construct the set of functions responsible for downloading/saving and loading the dataframe into memory for computation. In addition to this, we extract the required information that is possible to scrape headless.

In [None]:
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs

# Extract the links of different club's smoothcomp page link from the given URL
clubs = []

try:
    for i in range(0, 3000):
        # Request the landing page from smoothcomp
        page_number = i 
        response = requests.get(f"https://smoothcomp.com/en/club?search=&country=&continent=North%20America&page={page_number}")
        soup = bs(response.text, 'html.parser')
        # Extract the club's links from the current page along (discard all other information as they can be obtained from the next page)
        for item in soup.find_all('a'):
            _class = item.get("class")
            if _class and "color-inherit" in _class: 
                link = item.get('href')
                clubs.append(link)
        i+=1 # Increment the page number to capture a large set of clubs url's
        if i%10==0: print(f"\tScraped {i}th page")
except:
    print(f"Stopped/Crashed at page number:{i}.")

print(f"Acquired smoothcomp profiles of {len(clubs)} academies.")

print("Extracting club info...")
data = []
try:
    # Scrape information about each club from their smoothcomp profile to collect the required attributes information
    for i in range(0, len(clubs)):
        # Initialize dictionary for containing the current rows data (None as default when data not found)
        row = {"Name":None, "Location":None,"Contact Persons":None, "Affiliation":None} 
        # Extract information for the current club and pack into the dictionary if data exists
        response = requests.get(clubs[i])
        soup = bs(response.text, 'html.parser')
        row["Source"] = clubs[i]
        row["Name"] = soup.title.text.replace(" ","").replace("\n","").replace("-Smoothcomp","")
        club_info = [i for i in soup.find_all(id="clubInfo")[0].text.split('\n') if i]
        if "Location" in club_info: row["Location"] = club_info[int(club_info.index("Location")+1)]
        if "Contact Persons" in club_info: row["Contact Persons"] = club_info[int(club_info.index("Contact Persons")+1)]
        if "Affiliation" in club_info: row["Affiliation"] = club_info[int(club_info.index("Affiliation")+1)]
        # Append dictionary into the list of rows to convert into a pandas dataframe to work with
        data.append(row)
        i+=1 # Increment for next club
        if i%10==0:print(f"\ttScraped {i}th club")
except:
    print(f"\ttCompleted scrapping {i} clubs")

df = pd.DataFrame(data)
df.to_csv("ExtractedData.csv", encoding='utf-8', index=False) 
print(f"\n\nSaved csv with data on {len(df)} academies")

# Filtering and Extracting Data on Canadian Academies

In [73]:
import pandas as pd
pd.set_option('display.max_rows', None)

# Load all the data scraped from smoothcomp and stored into a csv
clubs_df = pd.read_csv("ExtractedData.csv")
clubs_df.drop("Affiliation",axis=1,inplace=True)
# Drop duplicates
clubs_df = clubs_df.drop_duplicates()
# Filter Canadian based Academies
canadian_clubs_df = clubs_df[clubs_df["Location"].str.contains("Canada|ON|Ontario")].reset_index(drop=True)
# Store the filtered df to a CSV
canadian_clubs_df.to_csv("CanadianData2.csv")
print("Stored filetered CSV")
canadian_clubs_df

Stored filetered CSV


Unnamed: 0,Name,Location,Contact Persons,Source
0,GracieBarraVancouver,Canada,Rodrigo Carvalho,https://smoothcomp.com/en/club/6605
1,"BattleArtsAcademy,Mississauga","4880 Tomken Rd, Mississauga, ON L4W 1J8,",Brendon May,https://smoothcomp.com/en/club/21170
2,CurrentBJJ,"6800 Kitimat Rd, Unit 27, Mississauga ON, L5N ...",Hannah Brown,https://smoothcomp.com/en/club/47454
3,CarlsonGracieTeam,Canada,Mario Decosta,https://smoothcomp.com/en/club/7888
4,BurnabyVancouverJiu-Jitsu(BVJJ),"3738 Canada Way,",Eduardo Cadena,https://smoothcomp.com/en/club/14454
5,BOABJJMMA,Canada,,https://smoothcomp.com/en/club/7735
6,TeamRenzoGracieOttawa,Canada,Pat Cooligan,https://smoothcomp.com/en/club/12007
7,10thPlanetVancouver,Canada,Nabil Salameh,https://smoothcomp.com/en/club/24572
8,RevelationMartialArts,Canada,John Sawatzky,https://smoothcomp.com/en/club/6750
9,Lionsmma,Canada,,https://smoothcomp.com/en/club/7789


## Using Google Maps API for Finding Additional Information
Attempting to collecting additional meta data on each of these venues, such as a more precise address, or their reviews, and online prescense using the Google Maps API. This also allows me to gain a basic understanding of how to work with the Google Maps API which could translate over to other data science based projects. 

In [83]:
import googlemaps

API_KEY="AIzaSyADXXlcUgkbm6aGMAiSPDNe3Wu9e80Ubwc"
COLS = ["formatted_address","name","opening_hours","business_status","geometry"]

# Initialize the gmaps client and iterate over the clubs df and collect additional information
gmaps = googlemaps.Client(key=API_KEY)
data = []
for club in canadian_clubs_df["Name"]:
    row = {"formatted_address":None, "name": None, "opening_hours":None, "business_status":None, "lat":None,"lng":None}
    try:
        response = gmaps.find_place(input=club,
                                    input_type="textquery",
                                    fields=COLS)

        # Clean response data for the selected candidate (highest match with given academy name)
        candidate_data = response["candidates"][0] 
        candidate_data["lat"] = candidate_data["geometry"]["location"]["lat"]
        candidate_data["lng"] = candidate_data["geometry"]["location"]["lng"]
        del candidate_data["geometry"]
    except: 
        print(f"Failed to retrive {club} data.")
        candidate_data = row
    data.append(candidate_data)

df = pd.DataFrame(data)
df.to_csv("gmapsData.csv", index=False)
print("Completed Scraping Meta Data Using Google Maps..")

Failed to retrive Pound4Pound-ResoluteJiu-Jitsu data.
Failed to retrive PacificTopTeamKamloops data.
Failed to retrive 10thPlanetJiuJitsuMontreal data.
Failed to retrive TITANSMMA data.
Failed to retrive CJMartialArtsandFitness data.
Failed to retrive BibianoFernandesAcademy data.
Failed to retrive None data.
Failed to retrive ArashiDoBEHRING data.
Failed to retrive ARESBJJMEX data.
Failed to retrive GracieBarraWhistler data.
Failed to retrive BarrieGracieHuamita data.
Failed to retrive YGK-BO4Kingston data.
Failed to retrive BRAZZILIANWARRIORSAGUASCLIENTES data.
Failed to retrive AmericanSubmissionWrestling data.
Failed to retrive WulfrunCharlottetown data.
Failed to retrive PacificTopTeamRichmond data.
Failed to retrive MXTHALIFAX data.
Completed Scraping Meta Data Using Google Maps..


## Visualizing Data
We could further rank the academies based on their club's statistics available on Smoothcomp. In addition to basic scraping, we could attempt to rank the pages using a algorithm (requires research) that scrapes all pages upto a certain rank, breaks down the text in them, and attempts to rank them based on probability of brand allignment for sales pitch calls.

In [97]:
# Load the gmaps data from csv
gmaps_df = pd.read_csv("gmapsData.csv")
# Combined gmaps data with the smoothcomp data
academies = pd.concat([canadian_clubs_df.reset_index(drop=True),gmaps_df.reset_index(drop=True)], axis=1)
# Clean data by droping reptitive rows 
academies.drop("Location",axis=1,inplace=True)
academies.drop("Name",axis=1,inplace=True)
academies.to_csv("data.csv")
academies

Unnamed: 0,Contact Persons,Source,business_status,formatted_address,name,opening_hours,lat,lng
0,Rodrigo Carvalho,https://smoothcomp.com/en/club/6605,OPERATIONAL,"2440 Main St, Vancouver, BC V5T 3E2, Canada",Gracie Barra Vancouver,{'open_now': False},49.263395,-123.100753
1,Brendon May,https://smoothcomp.com/en/club/21170,OPERATIONAL,"4880 Tomken Rd, Mississauga, ON L4W 1J8, Canada",Battle Arts Academy,{'open_now': False},43.625361,-79.63028
2,Hannah Brown,https://smoothcomp.com/en/club/47454,OPERATIONAL,"6800 Kitimat Rd #27, Mississauga, ON L5N 5M1, ...",Current Jiu Jitsu,{'open_now': False},43.60833,-79.73685
3,Mario Decosta,https://smoothcomp.com/en/club/7888,OPERATIONAL,"76 Progress Dr ste 238, Stamford, CT 06902, Un...",Carlson Gracie Team Stamford - CT,{'open_now': False},41.053566,-73.561882
4,Eduardo Cadena,https://smoothcomp.com/en/club/14454,OPERATIONAL,"3738 Canada Wy, Burnaby, BC V5G 1G5, Canada",Burnaby Vancouver Jiu-Jitsu,{'open_now': False},49.254423,-123.021981
5,,https://smoothcomp.com/en/club/7735,OPERATIONAL,"401 N Warren St, Orwigsburg, PA 17961, United ...",Boa Brazilian Jiu Jitsu LLC,{'open_now': False},40.657922,-76.099784
6,Pat Cooligan,https://smoothcomp.com/en/club/12007,OPERATIONAL,"432 Tralee Rd, Nepean, ON K2J 5Z7, Canada",The Resource Team,{'open_now': False},45.286558,-75.735299
7,Nabil Salameh,https://smoothcomp.com/en/club/24572,OPERATIONAL,"9811 NE 15th Ave UNIT 101, Vancouver, WA 98665...",10th Planet Vancouver,{'open_now': False},45.692499,-122.65627
8,John Sawatzky,https://smoothcomp.com/en/club/6750,OPERATIONAL,"1185 California Ave #5, Brockville, ON K6V 7N5...",Revelation Martial Arts and Fitness,{'open_now': False},44.613807,-75.693591
9,,https://smoothcomp.com/en/club/7789,OPERATIONAL,"1256 Granville St #1, Vancouver, BC V6Z 1M4, C...",Lions MMA,{'open_now': False},49.276483,-123.127025
