# github-topic-scrapping


# Pick a website and describe your objective

    -Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
    -Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
    -Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.


##  Steps we Will follow to scrap the github topic page

- We're going to scrape https://github.com/topics
- We'll get a list of topics. for each topic, we'll get topic title,topic page URL and topic description
- For each topic, we'll get the 25 repositories in the topic from the topic page
- for each repository,we'll grab the repo name, username,starts and repo url
- For each topic we'll create a CSV file .

##  We'll create a function to get the topic from the topic page

    -first we will import all the required libraries like pandas,BeautifulSoup,requests

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os

In [2]:
#function to get the parsed format of topic page
def topic_scrapper():
    url = 'https://github.com/topics'
    response = requests.get(url)
    #by using (.text) we get the page in text formate 
    page_content = response.text
    
    parsed_doc = BeautifulSoup(page_content,"html.parser")
    # now parsed_doc contains the whole page content in html parsed formate
    return parsed_doc


In [3]:
doc=topic_scrapper()

### How to get topic names

    -To get the topic names we have to get the 'p' tag with class...
   ![topic_image.png](attachment:topic_image.png)

In [4]:
#function to get the names of all the topics
def get_topic_name(parsed_doc):
    topic_name = parsed_doc.find_all('p',{'class': 'f3 lh-condensed mb-0 mt-1 Link--primary'})
    #topic name list
    topic_names=[]
    for n in topic_name:
        topic_names.append(n.text)
    return topic_names

In [5]:
#defining a function to get the topci descriptions
def get_topic_desc(parsed_doc):
    topic_description = parsed_doc.find_all('p',{'class': 'f5 color-fg-muted mb-0 mt-1'})
    #topic description list    
    topic_descp=[]
    for d in topic_description:
        topic_descp.append(d.text.strip())
    return topic_descp

In [6]:
#defining a function to get the topic page links
def get_topic_links(parsed_doc):
    topic_link = parsed_doc.find_all('a',{'class':'no-underline flex-1 d-flex flex-column'})
    #topic link list
    topic_links = []
    for l in topic_link:
        topic_links.append("https://github.com" + l["href"])
    return topic_links

### Now we are defining a function which:
    - call all the above defined function
    - get details from them like topic name ,etc..
    - make a dictionary of that details and returns a Dataframe of all the details

In [7]:
def scrap_topics():
    url = 'https://github.com/topics'
    response = requests.get(url)

     #checking for error
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    #by using (.text) we get the page in text formate 
    page_content = response.text
    parsed_doc = BeautifulSoup(page_content,"html.parser")
    # now parsed_doc contains the whole page content in html parsed formate
    
    topics_dictionary = {
                    "topic_name": get_topic_name(parsed_doc),
                    "topic_description": get_topic_desc(parsed_doc),
                    "topic_link": get_topic_links(parsed_doc)
                        }
    return pd.DataFrame(topics_dictionary)

###  Now we will get the repos from a topic page

- we'll get the username, repo_name ,stars and repo_url from the inside topic page

In [8]:
#defining a function to get the topic page
def get_topic_pages(topic_url):
    #loading page
    response = requests.get(topic_url)
    repo_page = response.text
    #checking for error
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
        
    #getting parsed page    
    repo_parsed = BeautifulSoup(repo_page,'html.parser')
    return repo_parsed

### username and repo name doesn't have any class so:
    we have to get their parent class and extract detail from their.

![username_repo.png](attachment:username_repo.png)


    

### After getting the username and repo details we'll get the star details from span tag

![stars.png](attachment:stars.png)


In [9]:
#defining a function which will get parsed page and star_getter 
#and return all the other things like username,repo_name,
#repo link and star count of that repo for a single topic page
def get_info(parent_tag,star_getter):
    baselink = 'https://github.com'
    child_tag = parent_tag.find_all('a')
    username = child_tag[0].text.strip()
    repo_name = child_tag[1].text.strip()
    repo_link = baselink + child_tag[1]['href']
    stars = star_counter(star_getter.text.strip())
    return username,repo_name,repo_link,stars

#defining a function  for converting star count to integer type from string
def star_counter(star_str):
    star_str = star_str.strip()
    if(star_str[-1]=='k'):
        return int(float(star_str[:-1])*1000)
    return int(star_str)

In [10]:
def get_topic_repos(repo_parsed):
#getting parents tag as username and other details are not under any class
    parent_tag = repo_parsed.find_all('h3',
            {'class':'f3 color-fg-muted text-normal lh-condensed'})
#getting star count for each repo
    star_getter = repo_parsed.find_all('span',
            {'class':'Counter js-social-count'})


#defining a function to create the dictionary of all the details 
#and convert all the details to data frame
    repo_page_dict = {
    'Username' : [],
    'Repo_name' : [],
    'Repo_link' : [],
    'Stars' : []
    }

    for i in range(len(parent_tag)):
        repo_info = get_info(parent_tag[i],star_getter[i])
        repo_page_dict['Username'].append(repo_info[0])
        repo_page_dict['Repo_name'].append(repo_info[1])
        repo_page_dict['Repo_link'].append(repo_info[2])
        repo_page_dict['Stars'].append(repo_info[3])
        
        
    return pd.DataFrame(repo_page_dict)

In [11]:
doc = get_topic_repos(get_topic_pages('https://github.com/topics/Android'))

In [12]:
doc.head()

Unnamed: 0,Username,Repo_name,Repo_link,Stars
0,flutter,flutter,https://github.com/flutter/flutter,142000
1,justjavac,free-programming-books-zh_CN,https://github.com/justjavac/free-programming-...,94100
2,Genymobile,scrcpy,https://github.com/Genymobile/scrcpy,67300
3,Hack-with-Github,Awesome-Hacking,https://github.com/Hack-with-Github/Awesome-Ha...,52800
4,google,material-design-icons,https://github.com/google/material-design-icons,46100


# Putting it all together by:
    -Defining a function which will call all the above functions
    -get all the data's
    -create a folder of name CSV
    -create all the data frames for all the topics
    -store all the csv in that respective folder

In [15]:
#function to check wheather the file already exist or not
#and if not exist them it will create the file and store inside the folder
def scrap_topic(topic_url,path):
    if os.path.exists(path):
        print('The file {} already exists. Skipping ...'.format(path))
        return
    topic_repo = get_topic_repos(get_topic_pages(topic_url))
    topic_repo.to_csv(path,index = None)


#defining a function to get the data in the 
def scrap_topic_repos():
    print('Scrapping list of topics')
    topics_df = scrap_topics()
    
    os.makedirs('CSV',exist_ok = True)
    
    for index,row in topics_df.iterrows():
        print('Scrapping top repositories for "{}"'.format(row['topic_name']))
        scrap_topic(row['topic_link'],'CSV/{}.csv'.format(row['topic_name']))

### Let's run it to scrape the top repos for all the topics on the first page of
https://github.com/topics

In [16]:
scrap_topic_repos()

Scrapping list of topics
Scrapping top repositories for "3D"
The file CSV/3D.csv already exists. Skipping ...
Scrapping top repositories for "Ajax"
The file CSV/Ajax.csv already exists. Skipping ...
Scrapping top repositories for "Algorithm"
The file CSV/Algorithm.csv already exists. Skipping ...
Scrapping top repositories for "Amp"
The file CSV/Amp.csv already exists. Skipping ...
Scrapping top repositories for "Android"
The file CSV/Android.csv already exists. Skipping ...
Scrapping top repositories for "Angular"
The file CSV/Angular.csv already exists. Skipping ...
Scrapping top repositories for "Ansible"
The file CSV/Ansible.csv already exists. Skipping ...
Scrapping top repositories for "API"
The file CSV/API.csv already exists. Skipping ...
Scrapping top repositories for "Arduino"
The file CSV/Arduino.csv already exists. Skipping ...
Scrapping top repositories for "ASP.NET"
The file CSV/ASP.NET.csv already exists. Skipping ...
Scrapping top repositories for "Atom"
The file CSV/At

Now let's check any csv file we have created to make sure the function is creating csv's

In [18]:
pd.read_csv('CSV/Android.csv')

Unnamed: 0,Username,Repo_name,Repo_link,Stars
0,flutter,flutter,https://github.com/flutter/flutter,142000
1,justjavac,free-programming-books-zh_CN,https://github.com/justjavac/free-programming-...,94100
2,Genymobile,scrcpy,https://github.com/Genymobile/scrcpy,67300
3,Hack-with-Github,Awesome-Hacking,https://github.com/Hack-with-Github/Awesome-Ha...,52800
4,google,material-design-icons,https://github.com/google/material-design-icons,46100
5,wasabeef,awesome-android-ui,https://github.com/wasabeef/awesome-android-ui,43000
6,square,okhttp,https://github.com/square/okhttp,42400
7,Solido,awesome-flutter,https://github.com/Solido/awesome-flutter,41400
8,android,architecture-samples,https://github.com/android/architecture-samples,41100
9,square,retrofit,https://github.com/square/retrofit,40200


## References and future works

    Summary:
     In this project we have divided the whole project in 3 phases:
     
     1-We will make list of all the topic names,topic descriptions,
         topic_url and form a dataset using pandas library
     2- We will make list of all the top repo,username,repo link and stars of that repo
         and form a dataset using pandas library
     3- In the final stage we will compile all the above work and write and function 
        to travers through all the topic links and make CSV file for each topic which 
        will contain all the details of username,repo name, repo link, star count
      
      
      
     What we have done to achieve the 1st stage is:
     1- We got the topic page url
     2- get the text formate of that page using requests.get() function
     3- parsed the text formate to html with the help of BeautifulSoup 
     4- after parsing the page we will:      
         -start goinf through the page and get the tag name under which the required details are mentioned and after 
          that we extract the details from that tags using find_all function with the help of name of class under the tag
     5- After extracting the details we make list of all the details
     6- After making the list,formed a dictionary to of lists and after that
     7- After making the list we made the dataframe using pandas 
     
    References to the link:
    -https://github.com
    -https://github.com/topics
    -https://www.crummy.com/software/BeautifulSoup/bs4/doc/