## Overview
This assessment involves analyzing the Calgary HTTP dataset, which contains approximately one year's worth of HTTP requests to the University of Calgary's Computer Science web server. You'll work with real-world web server log data to extract meaningful insights and demonstrate your Python data analysis skills.

## Part 1: Data Loading and Cleaning

In [1]:
# You can write your code here for data loading, cleaning, and exploration. Add cells as necessary.
# importing all libraries
import gzip
import re
import pandas as pd
from collections import Counter,defaultdict
from datetime import datetime

# creating the empty list for appending the split data ( remotehost, rfc931, authuser, date, request, status, bytes)
log = []

# creating the required pattern using Regex
pattern = r'(\S+) (\S+) (\S+) \[(.*?)\] "(.*?)" (\d{3}) (\S+)'

# reading the data from the zip file using gzip library
with gzip.open("F:\Data Analyst\MapUp\calgary_access_log.gz",'rt',encoding='utf-8',errors='ignore') as document:
    for l in document:
        exact = re.match(pattern, l)
        if exact:
            # extraction starts using split
            remotehost, rfc931, authuser, date, request, status, byte =exact.groups()
            #print(request)
            
            # extracting file extention from file name
            try:
                method,resource,protocol = request.split()
            except:
                method,resource,protocol = None,None,None
            #print(method,resource,protocol)
            
            # converting into datetime format using datetime library [24/Oct/1994:13:41:41 -0600]
            try:
                date_time = datetime.strptime(date,'%d/%b/%Y:%H:%M:%S %z')
            except:
                date_time=None
            #print(date_time)
            
            # converting byte value into int.
            byte_int = int(byte) if byte != '-' else 0
            
            #adding this modified data into the log list.
            log.append(
                {
                    'remotehost' : remotehost,
                    'rfc931' : rfc931,
                    'authuser' : authuser,
                    'date' : date_time,
                    'resource' : resource,
                    'status' : status,
                    'bytes' : byte_int
                }
            )
            
# now creating the dataframe to perform cleaning and manipulating operations
df = pd.DataFrame(log)
print(df.head())

print('\n',df['remotehost'].count())


  remotehost rfc931 authuser                       date    resource status  \
0      local      -        -  1994-10-24 13:41:41-06:00  index.html    200   
1      local      -        -  1994-10-24 13:41:41-06:00       1.gif    200   
2      local      -        -  1994-10-24 13:43:13-06:00  index.html    200   
3      local      -        -  1994-10-24 13:43:14-06:00       2.gif    200   
4      local      -        -  1994-10-24 13:43:15-06:00       3.gif    200   

   bytes  
0    150  
1   1210  
2   3185  
3   2555  
4  36403  

 724910


In [2]:
#print(df.dtypes)
# changing the datatype of date from objecct to date time
df['date'] = pd.to_datetime(df['date'],format="%d/%b/%Y:%H:%M:%S %z",errors='coerce')
df['status'] = df['status'].astype(int)
print(df.dtypes,'\n')

# cleaning the data where either of resource or date_time is null
df.dropna(subset=['date','resource'],inplace = True)
#print(df['date'].count())

# adding extra column acc. to requirements like hour and extension of file
df['date_str'] = df['date'].dt.strftime('%d-%b-%Y')
df['hour'] = df['date'].dt.hour
df['extension']=df['resource'].apply(lambda x:x.split('.')[-1])
print(df.head())
print('\n','count: ',df['hour'].count())

remotehost                       object
rfc931                           object
authuser                         object
date          datetime64[ns, UTC-06:00]
resource                         object
status                            int32
bytes                             int64
dtype: object 

  remotehost rfc931 authuser                      date    resource  status  \
0      local      -        - 1994-10-24 13:41:41-06:00  index.html     200   
1      local      -        - 1994-10-24 13:41:41-06:00       1.gif     200   
2      local      -        - 1994-10-24 13:43:13-06:00  index.html     200   
3      local      -        - 1994-10-24 13:43:14-06:00       2.gif     200   
4      local      -        - 1994-10-24 13:43:15-06:00       3.gif     200   

   bytes     date_str  hour extension  
0    150  24-Oct-1994    13      html  
1   1210  24-Oct-1994    13       gif  
2   3185  24-Oct-1994    13      html  
3   2555  24-Oct-1994    13       gif  
4  36403  24-Oct-1994    13       g

## Part 2: Analysis Questions

### Q1: Count of total log records

In [3]:
def total_log_records(dataframe) -> int:
    """
    Q1: Count of total log records.

    Objective:
        Determine the total number of HTTP log entries in the dataset.
        Each line in the log file represents one HTTP request.

    Returns:
        int: Total number of log entries.
    """

    # TODO: Implement logic to count log records

    return len(dataframe)  # Placeholder return


answer1 = total_log_records(df)
print("Answer 1:")
print(answer1)

Answer 1:
431879


### Q2: Count of unique hosts

In [4]:
def unique_host_count(dataframe) -> int:
    """
    Q2: Count of unique hosts.

    Objective:
        Determine how many distinct hosts accessed the server.

    Returns:
        int: Number of unique hosts.
    """

    # TODO: Implement logic to count unique hosts
    hosts = dataframe['remotehost'].unique()
    #print(hosts)
    uni_host = len(hosts)

    return  uni_host # Placeholder return


answer2 = unique_host_count(df)
print("Answer 2:")
print(answer2)

Answer 2:
2


### Q3: Date-wise unique filename counts

In [5]:
def datewise_unique_filename_counts(dataframe) -> dict[str, int]:
    """
    Q3: Date-wise unique filename counts.

    Objective:
        For each date, count the number of unique filenames that accessed the server.
        The date should be in 'dd-MMM-yyyy' format (e.g., '01-Jul-1995').

    Returns:
        dict: A dictionary mapping each date to its count of unique filenames.
              Example: {'01-Jul-1995': 123, '02-Jul-1995': 150}
    """

    # TODO: Implement logic for date-wise unique filename counts
    dict_uni_filenames = d = dataframe.groupby(['date_str'])['resource'].nunique().to_dict()

    return d  # Placeholder return


answer3 = datewise_unique_filename_counts(df)
print("Answer 3:")
print(answer3)

Answer 3:
{'01-Aug-1995': 669, '01-Jul-1995': 387, '01-Jun-1995': 590, '01-May-1995': 467, '01-Oct-1995': 552, '01-Sep-1995': 328, '02-Apr-1995': 438, '02-Aug-1995': 855, '02-Jul-1995': 397, '02-Jun-1995': 513, '02-May-1995': 701, '02-Oct-1995': 871, '02-Sep-1995': 349, '03-Apr-1995': 795, '03-Aug-1995': 582, '03-Jul-1995': 433, '03-Jun-1995': 398, '03-May-1995': 589, '03-Oct-1995': 846, '03-Sep-1995': 212, '04-Apr-1995': 821, '04-Aug-1995': 715, '04-Jul-1995': 610, '04-Jun-1995': 353, '04-May-1995': 684, '04-Oct-1995': 889, '04-Sep-1995': 340, '05-Apr-1995': 891, '05-Aug-1995': 507, '05-Jul-1995': 607, '05-Jun-1995': 494, '05-May-1995': 609, '05-Oct-1995': 846, '05-Sep-1995': 411, '06-Apr-1995': 678, '06-Aug-1995': 448, '06-Jul-1995': 522, '06-Jun-1995': 662, '06-May-1995': 517, '06-Oct-1995': 868, '06-Sep-1995': 549, '07-Apr-1995': 776, '07-Aug-1995': 608, '07-Jul-1995': 428, '07-Jun-1995': 486, '07-May-1995': 725, '07-Oct-1995': 468, '07-Sep-1995': 590, '08-Apr-1995': 542, '08-Aug-1

### Q4: Number of 404 response codes

In [6]:
def count_404_errors(dataframe) -> int:
    """
    Q4: Number of 404 response codes.

    Objective:
        Count how many times the HTTP 404 Not Found status appears in the logs.

    Returns:
        int: Number of 404 errors.
    """

    # TODO: Implement logic to count 404 errors
    error404 = len(dataframe[dataframe['status']==404])

    return error404  # Placeholder return


answer4 = count_404_errors(df)
print("Answer 4:")
print(answer4)

Answer 4:
14586


### Q5: Top 15 filenames with 404 responses

In [7]:
def top_15_filenames_with_404(dataframe) -> list[tuple[str, int]]:
    """
    Q5: Top 15 filenames with 404 responses.

    Objective:
        Identify which requested URLs most frequently resulted in a 404 error.
        Return the top 15 filenames sorted by frequency.

    Returns:
        list: A list of tuples (filename, count), sorted by count in descending order.
              Example: [('index.html', 200), ...]
    """

    # TODO: Implement logic to find top 15 filenames with 404
    top_15 = dataframe[dataframe['status']==404]['resource'].value_counts().head(15)
    list_top_15 = list(top_15.items())

    return list_top_15  # Placeholder return


answer5 = top_15_filenames_with_404(df)
print("Answer 5:")
print(answer5)

Answer 5:
[('index.html', 3119), ('4115.html', 901), ('1611.html', 647), ('5698.xbm', 500), ('710.txt', 254), ('10695.ps', 161), ('6555.html', 153), ('9678.gif', 142), ('3268.gif', 138), ('9814.gif', 134), ('11059.gif', 129), ('11060.gif', 129), ('9388.xbm', 120), ('151.html', 119), ('1685.html', 113)]


### Q6: Top 15 file extension with 404 responses

In [8]:
def top_15_ext_with_404(dataframe) -> list[tuple[str, int]]:
    """
    Q6: Top 15 file extensions with 404 responses.

    Objective:
        Find which file extensions generated the most 404 errors.
        Return the top 15 sorted by number of 404s.

    Returns:
        list: A list of tuples (extension, count), sorted by count in descending order.
              Example: [('html', 45), ...]
    """

    # TODO: Implement logic to find top 15 extensions with 404
    top_15 = dataframe[dataframe['status']==404]['extension'].value_counts().head(15)
    list_top_15_ext = list(top_15.items())
    return list_top_15_ext  # Placeholder return


answer6 = top_15_ext_with_404(df)
print("Answer 6:")
print(answer6)

Answer 6:
[('html', 8051), ('gif', 4013), ('xbm', 665), ('ps', 562), ('txt', 265), ('jpg', 200), ('cgi', 76), ('GIF', 42), ('htm', 40), ('gif"', 34), ('com', 29), ('com/', 24), ('dvi', 23), ('rgb', 21), ('html/', 21)]


### Q7: Total bandwidth transferred per day for the month of July 1995

In [9]:
def total_bandwidth_per_day(dataframe) -> dict[str, int]:
    """
    Q7: Total bandwidth transferred per day for the month of July 1995.

    Objective:
        Sum the number of bytes transferred per day.
        Skip entries where the byte field is missing or '-'.

    Returns:
        dict: A dictionary mapping each date to total bytes transferred.
              Example: {'01-Jul-1995': 123456789, ...}
    """

    # TODO: Implement logic to compute total bandwidth per day
    july_1995 = dataframe[dataframe['date'].dt.strftime('%b-%Y')=='Jul-1995'] 
    july_1995_byte = july_1995.groupby(['date_str'])['bytes'].sum().to_dict()
    return july_1995_byte  # Placeholder return


answer7 = total_bandwidth_per_day(df)
print("Answer 7:")
print(answer7)

Answer 7:
{'01-Jul-1995': 11333976, '02-Jul-1995': 8653986, '03-Jul-1995': 13508529, '04-Jul-1995': 26565884, '05-Jul-1995': 19541225, '06-Jul-1995': 19752989, '07-Jul-1995': 9427822, '08-Jul-1995': 5403491, '09-Jul-1995': 4660556, '10-Jul-1995': 14912796, '11-Jul-1995': 22503471, '12-Jul-1995': 17365039, '13-Jul-1995': 15986302, '14-Jul-1995': 19184404, '15-Jul-1995': 15769181, '16-Jul-1995': 9005564, '17-Jul-1995': 19596435, '18-Jul-1995': 17096829, '19-Jul-1995': 17847673, '20-Jul-1995': 20751717, '21-Jul-1995': 25455607, '22-Jul-1995': 8059932, '23-Jul-1995': 9577795, '24-Jul-1995': 22298075, '25-Jul-1995': 24472760, '26-Jul-1995': 24564950, '27-Jul-1995': 25967969, '28-Jul-1995': 36456855, '29-Jul-1995': 11684209, '30-Jul-1995': 23158170, '31-Jul-1995': 30715614}


### Q8: Hourly request distribution

In [10]:
def hourly_request_distribution(dataframe) -> dict[int, int]:
    """
    Q8: Hourly request distribution.

    Objective:
        Count the number of requests made during each hour (00 to 23).
        Useful for understanding traffic peaks.

    Returns:
        dict: A dictionary mapping hour (int) to request count.
              Example: {0: 120, 1: 90, ..., 23: 80}
    """

    # TODO: Implement logic for hourly distribution
    hour_count = dataframe['hour'].value_counts().sort_index().to_dict()

    return hour_count  # Placeholder return


answer8 = hourly_request_distribution(df)
print("Answer 8:")
print(answer8)

Answer 8:
{0: 11510, 1: 9832, 2: 9346, 3: 8101, 4: 7789, 5: 8234, 6: 9750, 7: 11896, 8: 17302, 9: 21637, 10: 25627, 11: 28584, 12: 26749, 13: 29997, 14: 29636, 15: 28041, 16: 28202, 17: 23229, 18: 17778, 19: 17253, 20: 17437, 21: 15915, 22: 14500, 23: 13534}


### Q9: Top 10 most requested filenames

In [11]:
def top_10_most_requested_filenames(dataframe) -> list[tuple[str, int]]:
    """
    Q9: Top 10 most requested filenames.

    Objective:
        Identify the most commonly requested URLs (irrespective of status code).

    Returns:
        list: A list of tuples (filename, count), sorted by count in descending order.
                Example: [('index.html', 500), ...]
    """

    # TODO: Implement logic to find top 10 most requested filenames
    filenames = dataframe['resource'].value_counts().head(10)
    top_10 = list(filenames.items())

    return top_10  # Placeholder return


answer9 = top_10_most_requested_filenames(df)
print("Answer 9:")
print(answer9)

Answer 9:
[('index.html', 75299), ('3.gif', 11949), ('2.gif', 11559), ('4097.gif', 4733), ('8870.jpg', 4492), ('244.gif', 4339), ('6733.gif', 4278), ('8472.gif', 3843), ('8308.gif', 3478), ('4.gif', 3357)]


### Q10: HTTP response code distribution

In [12]:
def response_code_distribution(dataframe) -> dict[int, int]:
    """
    Q10: HTTP response code distribution.

    Objective:
        Count how often each HTTP status code appears in the logs.

    Returns:
        dict: A dictionary mapping HTTP status codes (as int) to their frequency.
              Example: {200: 150000, 404: 3000}
    """

    # TODO: Implement logic for response code counts
    code_response = dataframe['status'].value_counts().to_dict()

    return code_response  # Placeholder return


answer10 = response_code_distribution(df)
print("Answer 10:")
print(answer10)

Answer 10:
{200: 328438, 304: 70131, 302: 16595, 404: 14586, 403: 2022, 401: 46, 500: 28, 501: 26, 400: 7}
