PRACTICAL QUESTIONS
1. What is NumPy, and why is it widely used in Python?

-  NumPy (Numerical Python) is a powerful open-source Python library used for numerical and scientific computing.

 Here are the key reasons:

 i.Efficient array handling:
Provides the ndarray object for fast multi-dimensional array operations.
Much faster than Python lists for large numerical data.

 ii.Mathematical operations:
Supports element-wise operations (addition, multiplication, etc.) on arrays.
Includes functions for linear algebra, statistics, Fourier transforms, etc.

 iii.Broadcasting:
Allows operations between arrays of different shapes without explicit looping.

 iv.Memory efficiency:
Arrays take less memory compared to regular Python data structures.

 v.Integration:
Works well with other libraries like Pandas, Matplotlib, Scikit-learn, etc.

 vi. Foundation of data science & ML in Python:
Used as a base for many other popular libraries in data science and machine learning

2. How does broadcasting work in NumPy?

-  Broadcasting in NumPy is a set of rules that allow NumPy to perform element-wise operations on arrays of different shapes and sizes — without making copies of data.

3. What is a Pandas DataFrame?

-  A Pandas DataFrame is a two-dimensional, size-mutable, and heterogeneous data structure used in Python for data manipulation and analysis. It is similar to a table in a database or an Excel spreadsheet, where data is arranged in rows and columns. Each column in a DataFrame can hold data of a different type, such as integers, floats, or strings, and both the rows and columns are labeled, making it easy to access and manipulate the data. DataFrames are particularly powerful because they allow for efficient handling of large datasets, including filtering, grouping, merging, reshaping, and handling missing values. They also support input and output operations with various file formats like CSV, Excel, SQL databases, and JSON, which makes them highly versatile for real-world data tasks.

4. Explain the use of the groupby() method in Pandas.

-  The groupby() method in Pandas is used to group data based on one or more columns. It allows you to split the data into groups, apply some function to each group independently, and then combine the results back into a single DataFrame or Series. This is particularly useful for performing operations like aggregation (e.g., sum, mean, count), transformation, or filtering on subsets of data.

  For example, if you have a DataFrame of sales records with a column for region, you can use groupby('region') to group the data by each region and then calculate total sales for each. This method is powerful for summarizing and analyzing structured data, and it’s widely used in data preprocessing and exploration tasks.

5. Why is Seaborn preferred for statistical visualizations?

-  Seaborn is preferred for statistical visualizations in Python because it provides a high-level interface for drawing attractive and informative statistical graphics. Built on top of Matplotlib, Seaborn simplifies complex visualization tasks by offering built-in themes, color palettes, and functions that make it easier to explore and understand data. It integrates well with Pandas DataFrames, allowing for seamless plotting of categorical and continuous variables. Seaborn also includes several specialized plots for visualizing statistical relationships, such as violin plots, box plots, pair plots, and heatmaps, making it an excellent tool for data analysis and exploration. Its ability to handle missing data, perform automatic aggregation, and display regression lines further enhances its appeal for data scientists and analysts


6. What are the differences between NumPy arrays and Python lists?

- NumPy arrays and Python lists are both used to store collections of data, but they differ significantly in their characteristics and use cases:

  Data Type Homogeneity:

  NumPy arrays are designed to store elements of a single, uniform data type (e.g., all integers, all floats). Python lists, conversely, can store elements of various data types within the same list (heterogeneous).

  Performance and Efficiency:

  NumPy arrays offer significantly better performance for numerical computations, especially with large datasets, due to their underlying C implementation and optimized operations. Python lists, while flexible, are generally slower for numerical tasks.

  Memory Usage:

  NumPy arrays are more memory-efficient than Python lists because they store data in contiguous blocks of memory and do not need to store type information for each individual element, unlike Python lists.

  Functionality and Operations:

  NumPy arrays provide a rich set of mathematical functions and vectorized operations (e.g., element-wise addition, multiplication) that are highly optimized for numerical tasks. Python lists offer more general-purpose methods for list manipulation (e.g., append(), insert(), remove()).

  Flexibility and Mutability:

  Python lists are highly flexible and dynamic in size; elements can be easily added, removed, or modified. NumPy arrays are generally fixed in size once created, though their elements can be modified. Resizing a NumPy array typically involves creating a new array.

  Use Cases:

  Python lists are suitable for general-purpose data storage and manipulation where flexibility and heterogeneous data are required. NumPy arrays are specifically designed for scientific computing, data analysis, and machine learning tasks that involve large numerical datasets and require high performance.  

7. What is a heatmap, and when should it be used?

-  A heatmap is a data visualization technique that uses color-coded representations to display the magnitude of values within a matrix or grid. Essentially, it's a way to represent data using colors, making it easier to quickly grasp patterns, trends, and outliers. Darker or warmer colors usually represent higher values, while lighter or cooler colors represent lower values.

  Heatmaps are versatile and can be used in various contexts, including:

  Website and app user behavior analysis:
  To understand where users click, scroll, or hover on a page, revealing areas of high engagement or neglect.

  Data analysis and visualization:
  To identify correlations, clusters, and outliers in datasets, particularly when dealing with large amounts of data.

  Business intelligence:
  To track performance metrics across different locations, products, or time periods, enabling data-driven decision-making.
  Scientific research:
  To visualize gene expression, protein interactions, or other biological data, revealing patterns and relationships.

  Financial analysis:
  To visualize correlations between stock prices or other financial indicators.

  Geographic data:
  To represent density, temperature, or other spatial data, such as population density or temperature variations.
    

8. What does the term “vectorized operation” mean in NumPy?

-   The term “vectorized operation” refers to performing operations on entire arrays (vectors, matrices) without using explicit loops like for or while. Instead of processing each element individually in Python, NumPy uses optimized C-based code under the hood to perform these operations on the whole array at once.

  This makes vectorized operations much faster and more efficient than looping through array elements manually. For example, if you want to add 5 to every element in an array, you can simply write array + 5 instead of looping through each element and adding 5.

9. How does Matplotlib differ from Plotly?

-  Hierarchical indexing (also called MultiIndexing) in Pandas allows you to have multiple levels of indexes on a DataFrame or Series. This provides a powerful way to work with higher-dimensional data in a 2D format.

10. What is the significance of hierarchical indexing in Pandas?

-  Matplotlib and Plotly are both powerful Python libraries for data visualization, but they differ significantly in their capabilities and usage.

  Matplotlib is a traditional, static plotting library primarily used for creating 2D plots. It is highly customizable and widely used in scientific and academic contexts. Plots created with Matplotlib are static images, meaning they don’t support interactive features like zooming, hovering, or tooltips by default. However, Matplotlib is known for its stability, simplicity, and ability to integrate well with other libraries like NumPy and Pandas.

  Plotly, on the other hand, is designed for interactive and web-based visualizations. It allows users to create dynamic charts that can be zoomed, panned, or hovered over to reveal additional information. Plotly supports both 2D and 3D graphics and can generate dashboards and charts that can be embedded in web applications. It’s especially popular in modern data analysis and business intelligence for its interactivity and aesthetic appeal.

11. What is the role of Seaborn’s pairplot() function?
-   The pairplot() function in Seaborn is used to create a grid of scatter plots that shows relationships between pairs of variables in a dataset, along with histograms or density plots on the diagonals. It is particularly helpful for exploratory data analysis (EDA) because it allows you to quickly visualize distributions and correlations between multiple features in a DataFrame.

  Each cell in the grid represents a scatter plot between two variables, and by default, Seaborn also color-codes the points based on a specified categorical variable (using the hue parameter), helping you understand how different categories relate to each other across the dataset.

12. What is the purpose of the describe() function in Pandas?

-  The describe() function in Pandas is used to generate descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset's distribution. It provides a quick and insightful overview of the data within a DataFrame or Series.

13. Why is handling missing data important in Pandas?

-  Handling missing data in Pandas is crucial because incomplete or null values can lead to incorrect analysis, misleading visualizations, or even cause errors in your code. When working with real-world datasets, it’s common to encounter missing values due to various reasons such as human error, data corruption, or different data collection methods.

  If missing data is not addressed properly, operations like calculating averages, correlations, or training machine learning models can yield inaccurate or biased results. Pandas provides powerful tools to detect, fill, or drop missing data so that you can ensure the integrity and reliability of your analysis. Proper handling of these values helps maintain the quality of the dataset, supports robust decision-making, and enhances the overall performance of data-driven applications.


14. What are the benefits of using Plotly for data visualization?

-  Plotly offers several benefits for data visualization, especially when creating interactive and visually appealing charts. One of the main advantages is its ability to produce interactive graphs directly in web browsers, allowing users to zoom, hover, and explore data in a more dynamic way compared to static libraries like Matplotlib.

  Plotly supports a wide variety of chart types, including basic ones like bar, line, and scatter plots, as well as more advanced visuals like 3D plots, maps, and dashboards. It is also easy to integrate with Pandas and NumPy, which makes it convenient for data analysts and scientists to visualize data directly from their workflows.

  Another key benefit is that Plotly works well with web-based environments like Jupyter Notebooks and Google Colab, and it can be used with Dash to build interactive dashboards for real-time data visualization. Its modern design and support for responsive layouts make it a strong choice for presentations and reports where clarity and interaction matter.

15. How does NumPy handle multidimensional arrays?

-  NumPy handles multidimensional arrays primarily through its core data structure, the ndarray (N-dimensional array). This ndarray object provides a highly efficient and optimized way to store and manipulate homogeneous data in a grid-like structure with multiple dimensions or "axes."
Internally, NumPy stores arrays in contiguous blocks of memory, which makes it very efficient for mathematical operations. It supports advanced indexing, slicing, broadcasting, and vectorized operations across dimensions. This allows users to perform complex operations on large datasets without writing explicit loops, making the code both faster and cleaner.

16. What is the role of Bokeh in data visualization?

-  Bokeh is a powerful Python library designed for creating interactive and visually appealing data visualizations for the web. Unlike static visualization libraries like Matplotlib, Bokeh enables users to build dynamic plots that can respond to user inputs such as zooming, panning, hovering, or selecting data points.

  The primary role of Bokeh in data visualization is to bridge the gap between Python-based data analysis and modern web-based visual presentation. It allows data scientists and analysts to create complex, interactive dashboards and visualizations with minimal effort. These visualizations can be easily embedded into web applications or exported as standalone HTML files

17. Explain the difference between apply() and map() in Pandas.

-  In Pandas, both apply() and map() are used to perform operations on data, but they differ in how and where they are applied:

  map()

  i.Works only on Series (one-dimensional).

  ii.It is typically used for element-wise transformations.

  iii.Useful when you want to map values using a function, dictionary, or Series.

  apply()

  i.Works on both Series and DataFrames.

  ii.More flexible; you can apply a function to each row or column in a DataFrame, or each element in a Series.

  iii.Useful for row-wise or column-wise operations in a DataFrame.

18. What are some advanced features of NumPy?

-   Here are some advanced features of NumPy that make it powerful for scientific computing and data analysis:

  1.Broadcasting

  Allows arithmetic operations between arrays of different shapes without explicit loops.

  Saves memory and improves performance.

  2.Vectorization

  Replaces explicit loops with array operations, making code faster and cleaner.

  3. Advanced Indexing and Slicing
You can use boolean masks, fancy indexing, and slicing to access and manipulate data.

  4.Structured Arrays

  Used to store complex records (like tables) in a single NumPy array.

  5.Universal Functions (ufuncs)

  Fast element-wise operations like np.add(), np.sin(), np.exp()

  6.Memory Efficiency

  NumPy uses contiguous memory blocks and efficient data types, reducing memory usage.

  7.Random Module

  Tools for generating random numbers, distributions, and reproducibility (np.random).

  8.Linear Algebra Operations

  Supports matrix multiplication, inverses, determinants, etc

19. How does Pandas simplify time series analysis?

-  Pandas simplifies time series analysis by providing a comprehensive set of tools to handle and manipulate time-indexed data efficiently. One of its key features is the ability to convert date strings into datetime objects using pd.to_datetime(), which allows for proper date and time indexing. With this, users can easily perform time-based indexing and filtering — such as selecting data from a specific month or year — using intuitive syntax. Pandas also supports powerful resampling capabilities, which enable users to change the frequency of data (for example, converting daily data into monthly averages) using functions like .resample().

  In addition, Pandas makes it easy to apply rolling or expanding window functions to compute moving averages or cumulative statistics, which are essential in trend and volatility analysis. It includes built-in functions like .shift() to create lag features, often used in forecasting models. Time zones can also be localized and converted with simple methods, allowing for accurate handling of global datasets. Furthermore, Pandas integrates seamlessly with plotting libraries, enabling quick and clear visualizations of trends over time. Overall, these features make Pandas a powerful and user-friendly tool for time series analysis, suitable for beginners and experts alike.

20. What is the role of a pivot table in Pandas?

-    A pivot table is a powerful tool used to summarize, analyze, and reorganize data efficiently. It allows users to transform large datasets into a more understandable format by grouping data based on specific columns and applying aggregation functions such as mean, sum, count, or others to the grouped data. The pivot table helps in identifying trends, patterns, and insights by rearranging rows and columns to highlight the relationship between different variables. For example, in a sales dataset, a pivot table can show the total sales per product category across different regions or months. By using the pivot_table() function in Pandas, users can easily specify which columns to use as rows, columns, and values, and choose the type of aggregation they want. This makes pivot tables particularly useful in data analysis and reporting tasks, especially when dealing with complex and multidimensional data.

21. Why is NumPy’s array slicing faster than Python’s list slicing?

-  NumPy’s array slicing is faster than Python’s list slicing primarily because of the following reasons:

  1.Fixed-Type and Homogeneous Data

  NumPy arrays store data of the same type in contiguous blocks of memory.

  Python lists are heterogeneous (can store mixed types), requiring additional memory overhead and pointer indirection.

  2.Contiguous Memory Layout

  NumPy arrays use a compact memory layout (C-style or Fortran-style), enabling direct memory access.

  Slicing just returns a view (not a copy) of the original array, avoiding data duplication.

  3.Implemented in C

  NumPy is written in C and optimized for performance. Operations like slicing are executed using low-level C functions.

  Python lists are high-level and dynamically managed, making them slower for operations like slicing.

  4.No Type Checking During Slicing
  
  NumPy arrays don’t need to check the type of each element during slicing.

  Python lists must check types at runtime due to their dynamic nature.

22. What are some common use cases for Seaborn?

-  Seaborn is a powerful Python data visualization library built on top of Matplotlib, designed to make it easier to create attractive and informative statistical graphics. Some common use cases for Seaborn include:

  Exploratory Data Analysis (EDA): Seaborn helps visualize distributions, relationships, and patterns in datasets using plots like histograms, boxplots, and scatterplots.

  Visualizing Statistical Relationships: It is ideal for visualizing relationships between multiple variables with plots such as scatterplot(), lineplot(), lmplot() (for regression), and pairplot() (for comparing all variable pairs).

  Distribution Analysis: Seaborn provides functions like distplot() (deprecated), histplot(), and kdeplot() to study data distributions, including histograms and kernel density estimates.

  Categorical Data Visualization: With functions like boxplot(), violinplot(), stripplot(), and swarmplot(), Seaborn is excellent for summarizing categorical data with respect to numerical data.

  Heatmaps and Correlation Analysis: Seaborn’s heatmap() is widely used to visualize correlation matrices or frequency tables in a color-coded format.

  Time Series Analysis: Using lineplot(), Seaborn makes it easy to plot and compare time-based data.

  Faceting: Seaborn supports subplots using functions like FacetGrid and catplot() to plot data subsets across multiple panels for comparison.









THEORY QUESTIONS

In [None]:
# 1. How do you create a 2D NumPy array and calculate the sum of each row?
import numpy as np
arr = np.array([[1,2,3],[4,5,6]])
arr_sum = np.sum(arr, axis = 1)
print(arr_sum)

In [None]:
#2. Write a Pandas script to find the mean of a specific column in a DataFrame.
import pandas as pd
l = [[1,2,3],[4,5,6],[7,8,9]]
df = pd.DataFrame((l),  columns =['A','B','C'])

# print(df)
print(df.mean())

mean_b = df["B"].mean()

print("Mean of column B:", mean_b)

In [None]:
#3. Create a scatter plot using Matplotlib.
import matplotlib.pyplot as plt
x = [1,2,3,4,5]
y = [2,4,6,8,10]

plt.scatter(x,y)
plt.show()

In [None]:
#4. How do you calculate the correlation matrix using Seaborn and visualize it with a heatmap)
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
l = [[1,2,3],[4,5,6],[7,8,9]]
df = pd.DataFrame((l),  columns =['A','B','C'])
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix,annot=True, cmap='RdBu_r')
plt.show()


In [None]:
#5. Generate a bar plot using Plotly.
import plotly.graph_objects as go
import plotly.express as px
fig = go.Figure()
fig.add_trace(go.Bar(x =[1,2,3,4,5], y = [3,8,5,2,6]))

In [None]:
#6. Create a DataFrame and add a new column based on an existing column.
import pandas as pd
df  = pd.DataFrame({'A':[1,2,3], 'B':[4,5,6]})
df['C'] = [7,8,9]
df

In [None]:
#7. Write a program to perform element-wise multiplication of two NumPy arrays.
import pandas as pd
import numpy as np
arr1 = np.array([[1,2,3],[7,8,9]])
arr2 = np.array([[4,5,6],[6,5,2]])
arr3 = arr1 * arr2
print(arr3)

In [None]:
#8. Create a line plot with multiple lines using Matplotlib.
import matplotlib.pyplot as plt
x =  [1,2,3,4,5]
y1 = [2,4,6,8,10]
y2 = [3,6,9,12,15]
y3 = [7,2,4,5,6,9]
plt.plot(x, y1, label='Line 1')
plt.plot(x, y2, label='Line 2')

plt.legend()
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.title("Multiple Line Plot")
plt.show()

In [None]:
#9. Generate a Pandas DataFrame and filter rows where a column value is greater than a threshold.
import pandas as pd
l =[[1,2,3],[4,5,6],[7,8,9]]
df = pd.DataFrame((l),  columns =['A','B','C'])
print(df)
df[df['B'] > 5]

In [None]:
#10. Create a histogram using Seaborn to visualize a distribution.
import seaborn as sns
import matplotlib.pyplot as plt
sns.histplot([1,2,3,4,5])
plt.show()

In [None]:
#11. Perform matrix multiplication using NumPy.
arr1 = np.array([[1,2,3],[7,8,9]])
arr2 = np.array([[4,5],[6,5],[2,1]]) # Modified arr2 to have dimensions 3x2
arr3 = arr1 @ arr2
print(arr3)

In [None]:
#12. A Use Pandas to load a CSV file and display its first 5 rows.
import pandas as pd
df = pd.read_csv("taxonomy.csv")
df.head(5)


In [None]:
#13. Create a 3D scatter plot using Plotly.
import plotly.express as px
import numpy as np
# x = np.random.rand(100)
# y = np.random.rand(100)
# z = np.random.rand(100)
fig = px.scatter_3d(x = [2,4,5,6,3,4,5,6,7,9,8,4,5], y = [8,9,4,5,2,4,5,6,3,4,5,6,7,] ,z = [1,2,3,2,4,5,6,3,4,5,6,7,4])
fig