# Exploratory Data Analysis

In this exercise, we will explore a dataset containing samples collected from various SD-WAN network devices. The dataset includes variables such as CPU utilization, memory consumption, and software version, which can be analyzed to identify common patterns and behaviors across the devices.


###Objectives.

- Load a dataset in .csv format.
- Explore the capabilities of the NumPy, Pandas, Matplotlib, and Seaborn libraries for data analysis and visualization


Before diving into the data exploration and analysis, we need to import the necessary libraries that provide the tools and functions we'll be using. These libraries are essential for tasks such as data manipulation, visualization, and statistical analysis. Here's a breakdown of the libraries we'll be importing and their roles in our analysis:"

- **NumPy**: NumPy is the fundamental package for scientific computing in Python. It provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays. We'll leverage NumPy for numerical computations and data transformation
- **Pandas**: This library is crucial for data manipulation and analysis. We'll use it to load our dataset, clean and preprocess it, and perform various operations on the data.
- **Matplotlib**: This library provides a wide range of tools for creating static, interactive, and animated visualizations in Python. We'll utilize Matplotlib to generate plots and charts that help us understand the patterns and relationships within the data.
- **Seaborn**: Built on top of Matplotlib, Seaborn provides a higher-level interface for creating visually appealing and informative statistical graphics. We'll use Seaborn to generate more advanced visualizations and explore the data's statistical properties.

In [None]:
pip install seaborn

In [None]:
pip install openpyxl

In [None]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

Next, we'll check the versions of the libraries we're using. Libraries are frequently updated, and knowing their versions is helpful for referencing the correct documentation, ensuring the analysis can be reproduced in other environments, and facilitating troubleshooting.

In [None]:
print(f"Numpy Version: {np.__version__}")
print(f"Pandas Version: {pd.__version__}")
print(f"Matplotlib Version: {matplotlib.__version__}")
print(f"Seaborn Version: {sns.__version__}")

Let's import and examine the dataset dictionary, provided by **domain experts** to understand the variables included in the analysis.


In [None]:
Dictionary = pd.read_excel('NEAI_EDA_Dictionary.xlsx')
Dictionary

Import the dataset from a CSV file and perform initial exploratory analysis to understand its underlying relationships and potential challenges. We can achieve this using Pandas for data loading and manipulation, and potentially Matplotlib and Seaborn for basic visualizations.

In [None]:
Path = 'Dataset Meraki.csv'
data = pd.read_csv(Path, sep=';', na_values=['?'], encoding = 'utf-8')

In [None]:
data.shape

In [None]:
data.head()

In [None]:
data.describe()

In [None]:
print("Missing Values")
data.isnull().sum()

In [None]:
print("Duplicated Rows")
data.duplicated().sum()

In [None]:
# names of columns with only duplicated values (single unique value)
single_value_cols = []
for col in data.columns:
    num_duplicates = data[col].duplicated().sum()
    total_values = data.shape[0]
    if num_duplicates == total_values - 1 and total_values > 1:
        single_value_cols.append(col)

print(f"Columns with a single unique value: {single_value_cols}")

In [None]:
unique_cols = [col for col in data.columns if data[col].nunique() == data.shape[0]]
print(f"Columns with only different values: {unique_cols}")

In [None]:
data.iloc[50]

In [None]:
data.iloc[1:10]

In [None]:
datac = data.copy()

# Drop the 'timestamp' column as it's not relevant for correlation calculation
if 'timestamp' in datac.columns:
    datac = datac.drop('timestamp', axis=1)
    print("Dropped 'timestamp' column.") # Optional: Add a print statement to confirm

# Select only numeric columns for correlation calculation
# This will automatically exclude any remaining non-numeric columns
numeric_datac = datac.select_dtypes(include=np.number)

# Check if there are any numeric columns left before calculating correlation
if not numeric_datac.empty:
    correlation_matrix = numeric_datac.corr()
    plt.figure(figsize=(12, 10))
    sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
    plt.title("Correlation Matrix (Numeric Columns)")
    plt.show()
else:
    print("No numeric columns found to calculate correlation after dropping 'timestamp'.")

In [None]:
%matplotlib inline

In [None]:
plt.figure(figsize=(16,6))

plot_data = data['cpu_utilization'].value_counts()
x = plot_data.index
y = plot_data.values

plt.bar(x,y)
plt.title('CPU Utilization')
plt.xticks(rotation=90)

plt.show()

In [None]:
plt.figure(figsize=(16,6))

plot_data = data['mem_utilization'].value_counts()
x = plot_data.index
y = plot_data.values

plt.bar(x,y)
plt.title('Memory Utilization')
plt.xticks(rotation=90)

plt.show()

In [None]:
## Analyzing CPU Utilization by Software Version

# Create a box plot to show the distribution of CPU utilization for each software version.
# Box plots are useful for visualizing the distribution, median, and potential outliers
# of a numerical variable across different categories.
plt.figure(figsize=(15, 7))
sns.boxplot(x='version', y='cpu_utilization', data=data)
plt.title('CPU Utilization by Software Version')
plt.xlabel('Software Version')
plt.ylabel('CPU Utilization (%)')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

In [None]:
## Analyzing Memory Utilization by Software Version

# Create a box plot to show the distribution of CPU utilization for each software version.
# Box plots are useful for visualizing the distribution, median, and potential outliers
# of a numerical variable across different categories.
plt.figure(figsize=(15, 7))
sns.boxplot(x='version', y='mem_utilization', data=data)
plt.title('Memory Utilization by Software Version')
plt.xlabel('Software Version')
plt.ylabel('Memory Utilization (%)')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(6,6))

x = data['mem_utilization']
y = data['cpu_utilization']
forma = '*'

plt.plot(x,y,forma, color='blue')
plt.title('mem vs cpu')
plt.xlabel('memory')
plt.ylabel('cpu')
plt.show()

End of exploratory data analysis