<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Lab: Exploring the Dataset**


Estimated time needed: **30** minutes


## Introduction


Data exploration is the initial phase of data analysis where we aim to understand the data's characteristics, identify patterns, and uncover potential insights. It is a crucial step that helps us make informed decisions about subsequent analysis.


## Objectives


After completing this lab, you will be able to:


-   Summarize the key characteristics of a dataset.
-   Identify different data types commonly used in data analysis.


### Install the required library


In [1]:
import micropip

await micropip.install('pandas')

# Import pandas after installation
import pandas as pd
print(pd.__version__)


Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


2.2.0


## Load the dataset


<h3>Read Data</h3>
<p>
We utilize the <code>pandas.read_csv()</code> function for reading CSV files. However, in this version of the lab, which operates on JupyterLite, the dataset needs to be downloaded to the interface using the provided code below.
</p>


The functions below will download the dataset into your browser:


In [2]:
from pyodide.http import pyfetch

async def download(url, filename):
    response = await pyfetch(url)
    if response.status == 200:
        with open(filename, "wb") as f:
            f.write(await response.bytes())

In [3]:
file_path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv"

To obtain the dataset, utilize the download() function as defined above:  


In [4]:
await download(file_path, "survey_data.csv")
file_name="survey_data.csv"

Utilize the Pandas method read_csv() to load the data into a dataframe.


In [5]:
df = pd.read_csv(file_name)

> Note: This version of the lab is working on JupyterLite, which requires the dataset to be downloaded to the interface.While working on the downloaded version of this notebook on their local machines(Jupyter Anaconda), the learners can simply **skip the steps above,** and simply use the URL directly in the `pandas.read_csv()` function. You can uncomment and run the statements in the cell below.


# Hands on Lab


## Explore the dataset


It is a good idea to print the top 5 rows of the dataset to get a feel of how the dataset will look.


Display the top 5 rows and columns from your dataset.


In [8]:
## Write your code here
print(df.iloc[:5, :5])


   ResponseId                      MainBranch                 Age  \
0           1  I am a developer by profession  Under 18 years old   
1           2  I am a developer by profession     35-44 years old   
2           3  I am a developer by profession     45-54 years old   
3           4           I am learning to code     18-24 years old   
4           5  I am a developer by profession     18-24 years old   

            Employment RemoteWork  
0  Employed, full-time     Remote  
1  Employed, full-time     Remote  
2  Employed, full-time     Remote  
3   Student, full-time        NaN  
4   Student, full-time        NaN  


## Find out the number of rows and columns


Start by exploring the numbers of rows and columns of data in the dataset.


Print the number of rows in the dataset.


In [9]:
## Write your code here
# Print the number of rows in the dataset
print("Number of rows:", df.shape[0])


Number of rows: 65437


Print the number of columns in the dataset.


In [10]:
## Write your code here
# Using the .shape attribute
print("Number of columns:", df.shape[1])

# Alternatively, using len() on df.columns
print("Number of columns:", len(df.columns))



Number of columns: 114
Number of columns: 114


## Identify the data types of each column


Explore the dataset and identify the data types of each column.


Print the datatype of all columns.


In [11]:
## Write your code here
# Print data types for all columns in the DataFrame
print(df.dtypes)


ResponseId               int64
MainBranch              object
Age                     object
Employment              object
RemoteWork              object
                        ...   
JobSatPoints_11        float64
SurveyLength            object
SurveyEase              object
ConvertedCompYearly    float64
JobSat                 float64
Length: 114, dtype: object


Print the mean age of the survey participants.


In [13]:
import pandas as pd

# Read in the CSV file
file_name = "survey_data.csv"
df = pd.read_csv(file_name)

# Display the first few rows to inspect the Age column
print(df.head())

# Create a mapping dictionary for the age groups to estimated numeric values
age_map = {
    "Under 18 years old": 17,
    "18-24 years old": 21,
    "25-34 years old": 29.5,
    "35-44 years old": 39.5,
    "45-54 years old": 49.5,
    "55-64 years old": 59.5,
    "65 years or older": 65
}

# Check if the 'Age' column exists and map the values
if 'Age' in df.columns:
    # Create a new column with numeric age estimates
    df['Age_numeric'] = df['Age'].map(age_map)
    
    # Optional: Check for any missing mappings
    if df['Age_numeric'].isnull().any():
        print("Warning: There are unmapped age values. Please verify the age groups in your dataset.")
    
    # Calculate the mean age using the numeric column
    mean_age = df['Age_numeric'].mean()
    print("Mean age of survey participants (estimated):", mean_age)
else:
    print("The 'Age' column was not found in the dataset.")


   ResponseId                      MainBranch                 Age  \
0           1  I am a developer by profession  Under 18 years old   
1           2  I am a developer by profession     35-44 years old   
2           3  I am a developer by profession     45-54 years old   
3           4           I am learning to code     18-24 years old   
4           5  I am a developer by profession     18-24 years old   

            Employment RemoteWork   Check  \
0  Employed, full-time     Remote  Apples   
1  Employed, full-time     Remote  Apples   
2  Employed, full-time     Remote  Apples   
3   Student, full-time        NaN  Apples   
4   Student, full-time        NaN  Apples   

                                    CodingActivities  \
0                                              Hobby   
1  Hobby;Contribute to open-source projects;Other...   
2  Hobby;Contribute to open-source projects;Other...   
3                                                NaN   
4                                 

The dataset is the result of a world wide survey. Print how many unique countries are there in the Country column.


In [14]:
## Write your code here
# Calculate the number of unique countries in the 'Country' column
num_unique_countries = df['Country'].nunique()

# Print the result
print("Number of unique countries in the dataset:", num_unique_countries)


Number of unique countries in the dataset: 185


Copyright ©  IBM Corporation. All rights reserved.
