Car Data Analysis Project for Engage-

For this project, we aim to extensively study and analyse the data given to us to be able to draw significant correlations and understand the patterns and trends in the automotive market. This helps us build conclusive results and a desirable set of specifications that must be required to deliver products that are actively accepted in the market. This helps the manufacturers to understand the market better so that they are able to launch car models that optimise costs and maximise profit.

Technology Stack Used-

Introduction-

HTML-CSS is used at frontend for the purpose of our web-based application project. Jupyter Notebook is used at the backend to generate the data analytics using Pythonic libraries such as Numpy, Pandas, Matplotlib, SeaBorn and Plotly to execute Exploratory Data Analysis(EDA) and deliver useful graphs and insights. The report follows a mathematical approach using k-means clustering to acheive the objective of identifying clusters of correlation. MySQL is used for identifying relationships and queries among the various variables. Python codes are used for cleaning data. PowerBI is used for interactive data visualization which makes the analysis of data easier.

Challenge-

The automotive industry is one of the largest industries out there it's a 2.6 trillion dollar industry! India's Automotive Industry is worth more that USD 100 billion and contributes 8% of country's total export. The industry accounts for 2.3% of India's GDP and is set to become the 3rd largest in the world by 2025. The industry consists of many categories and subcategories which are thereby constructed by many variables that it can be said that every category is an industry in itself.

For instance the car body type variable is a vital one, here is a list of the car body types used by our data-

SUV(Sport Utility Vehicle)
Sedan
Hatchback
Coupe
MPV(Multi-Purpose Vehicle)
MUV(Multi Utility Vehicle)
Covertible
Crossover
Pick-Up
Sports

And all of the variety above is only regarding the car body type which is only one variable! Similarly, 40+ car manufacterers can be identified in the given dataset. Not to mention the grey areas where some car body types can be irrelevant to customer decision.

Data-

The dataset used in this report contains cars with their variants with 1200+ model/variants to study over 150+ features. Cleaning of data is done by running a series of python codes for removal of units, irregularities, etc. For example- Power and torque are equalised at 1000rpm each for the sake of comparision and the mode of the variables/ features has been used to fill in the empty cells where the data was unavailable.

Additional data was used from the internet to support the analysis and gather an approach for query solution.

Exploratory Data Analysis-

Cars Count by Make-

Key Findings-

The Top 5 companies with more than car variants in India are Maruti Suzuki, Hyundai, Mahindra, Tata, and Toyota.
Sports car variants are low.

Cars Count by Car Body Type-

Key Findings-

The Top 3 body types in India are Hatchbacks, SUVs and Sedans.

Cars by Cylinders-

Key Findings-

Most cars use 4 cylinders followed by 3 and 6 cylinders.

Cars by Seating Capacity-

Key Findings-

Most cars are 5 seaters followed by 7 seaters.

Cars Count by Engine Fuel Type-

Key Findings-

Most cars use Petrol and Diesel.

Cars count by Engine Size w.r.t Displacement-

Cars count by Engine Size w.r.t Power-

Relationship between Displacement and Price-

Key Findings-

Displacement is directly proportional to Price; Higher the price, higher the displacement.

Relationship between power and price-

Key Findings-

Horsepower of car is related to car price.
Hatchbacks are the body type with the least horsepower and price.

Relationship between price and mileage-

Key Findings-

Expensive cars tend to have worse mileage and vice versa.

Checking Ex-Showroom Price distribution using normal and log scales due to the huge difference in prices-

Key Findings-

There is a lot of variance in price that can be checked by plotting a box plot

Box Plot for Ex-Showroom price-

Key Findings-

There are a lot of outliers.
Outliers are mostly from the Sports and Coupe category as shown in the box plot below.

Box Plot of price vs. body type-

Key Findings-

Car body type affects the price.

Plotting an extensive scatter plot grid of more numerical variable to investigate the relation in more data-

Key Findings-

There exists multicollinearity between variables.

Plotting pairwise relationships as pair plot visualization comes handy for Exploratory data analysis(EDA). Pairing plot visualizations from the given data helps find the relationship between them where the variables can be continuous or categorical.

Key Findings-

Above graphs give a relationship between Displacement and Price with respect to the Fuel Type.

Check for the overall correlation between variables using Pearson correlation matrix-

A correlation of -1.0 indicates a perfect negative correlation, and a correlation of 1.0 indicates a perfect positive correlation. If the correlation coefficient is greater than zero, it is a positive relationship. Conversely, if the value is less than zero, it is a negative relationship.

Key Findings-

Price is positively related to Displacement
Price is positively related to Cylinders
Price is positively related to Power
Price is positively related to Torque
Displacement is positively related to Cylinders
Displacement is positively related to Power
Displacement is positively related to Torque
Displacement is positively related to Fuel Tank
Cylinders is positively related to Power
Cylinders is positively related to Torque
Cylinders is positively related to Fuel Tank
Power is positively related to Torque
Torque is positively related to Width
Torque is positively related to Length
Wheelbase is positively related to Power
Wheelbase is positively related to Torque
Wheelbase is positively related to Length
Doors is positively related to Seating Capacity
Fuel tank is positively related to Displacement.

Other Challenges-

As shown in previous figures clustering the market needs a lot of effort as the separation of clusters is not that obvious. It's now clear that we have to look for many dimensions in order to cluster the automotive market. Since the more features we explore, the harder it is to cluster. These dimensions affect the decision of the buyers and is also preceived as totally different due to the various different mental models of buyers, in other words, price, horsepower and mileage are not everything and some buyers would like to have a long wheel base car, some would like to have wider car all of the previous features, and more, strongly affect the buyer' decisions.

This means that two cars can have a very similar price and milage but one is a van with lots of space and the other is just a four doors sedan, these two cars are precieved as two different categories in the automotive industry so space "length, width and height of the car" can also be a vital factor. So, a three dimensional representation won't tell everything, so thats why we will try to consider clustering to use the very different features associated with each car.

Graphs and conclusions for the most sold car models

Now we can check for the most popular car specifications and combinations from the models with the most units sold, which are stated in order as below-

WagonR
Swift
Dzire
Nexon
Alto
Ertiga
Seltos
Venue
Eeco
Punch
Creta
Vitara Brezza
Celerio
Sonet
Grand i10
Baleno
i20
S-Presso
Amaze
Tiago

Most sold cars count by car body type-

Key Findings-

SUVs are the most sold car body type.

Most sold cars count by make-

Key Findings-

Maruti Suzuki is the most sold car manufacturer.

Checking Ex-Showroom Price distribution using normal and log scales due to the huge difference in prices

Most sold cars count by engine fuel capacity-

Most sold cars count by power-

Most sold cars count by torque-

Most sold cars count by car type-

Key Findings-

Manual cars are the most sold car type followed by automatic cars.

Most sold cars count by minimum turning radius-

Conclusion-

78.6% of the most sold cars had 4 cylinders
BS 6 accounted for 67.93% of most sold cars
5 seaters consist of 91.83% of most sold cars
Maruti Suzuki is the most popular car manufacturer
SUV accounted for 38.59% of most sold cars.
Petrol cars accounted for 61.41% most sold cars.
79% of most sold cars have 2 airbags.
68.63% of most sold cars have 5 gears.

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). Clustering divides the data points into a number of groups such that data points in the same groups are more similar to other data points in the same group and dissimilar to the data points in other groups. It is basically a collection of objects on the basis of similarity and dissimilarity between them. It is the main task of an exploratory data analysis, and a common technique for statistical data analysis, used in many fields, including pattern recognition, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning.

The appropriate clustering algorithm and parameter settings (including parameters such as the distance function to use, a density threshold or the number of expected clusters) depend on the individual data set and intended use of the results. Cluster analysis as such is not an automatic task, but an iterative process of knowledge discovery or interactive multi-objective optimization that involves trial and failure. It is often necessary to modify data preprocessing and model parameters until the result achieves the desired properties.

The type of clustering we are using here is K-Means clustering K-Means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. K-means clustering minimizes within-cluster variances (squared Euclidean distances).

Now we can check some scatter plots but by adding clusters.

Plotting a 3D scatter plot to check for power, mileage and the car manufacturer-

Plotting a 3D scatter plot to check for price, power and the mileage-

Average prices of each cluster are as follows-

Number of cars existing in each cluster-

Car body types in each cluster-

Price vs Power-

Key Findings-

Clusters are strongly affected by the price with clear speration between clusters.
No clusters can be separated for displacement.
Expensive cares tend to have higher power and vice versa.

Price vs Displacement

Key Findings-

Cluter separation can be performed on price.
No clusters can be formed for displacement.
Expensive cares tend to have higher displacement and vice versa.

Price vs Cylinders

Key Findings-

Cluter separation can be performed on price.
No clusters can be formed for cylinders.
Expensive cares tend to have higher number of cylinders and vice versa.

Price vs Torque

Key Findings-

Cluter separation can be performed on price.
No clusters can be formed for torque.
Expensive cares tend to have higher torque and vice versa.

Cylinders vs Displacement

Key Findings-

Clusters are too sparse and blurry to be identified hence no conclusions can be made.

Power vs Displacement

Key Findings-

Clusters are too sparse and blurry to be identified hence no conclusions can be made.

Torque vs Displacement

Key Findings-

Clusters are too sparse and blurry to be identified hence no conclusions can be made.

Fuel Tank vs Displacement

Key Findings-

Clusters are too sparse and blurry to be identified hence no conclusions can be made.

Power vs Cylinders

Key Findings-

Cluster separation can be performed on cylinders.
No clusters can be formed for Power.
High power cars tend to use a higher number of cylinders in them and vice-versa.

Torque vs Cylinders

Key Findings-

Cluter separation can be performed on cylinders.
No clusters can be formed for Torque.
High torque cars tend to use a higher number of cylinders in them and vice-versa.

Fuel Tank Capacity vs Cylinders

Key Findings-

Cluter separation can be performed on cylinders.
No clusters can be formed for fuel tank capacity.
Cars with a higher number of cylinders used in them tend to have a higher fuel tank capacity and vice-versa.

Power vs Mileage

Key Findings-

Clusters are too close to each other and blurry to be identified hence no conclusions can be made.

Mileage vs ex-showroom price

Key Findings-

Cluster separation can be performed on mileage.
No clusters can be formed for ex-showroom price.
Mileage decreases as price increases and vice-versa.

Power vs Fuel Tank-

Key Findings-

Cluster separation can be performed on horsepower.
No clusters can be formed for fuel tank capacity.
Power increases as fuel tank capacity increases and vice-versa.

Why use Clustering?

With clustering there are too many variables taken into consideration which are hard to be traced by other normal methods. The clusters generated by the K-Means model can be used to identify strategic groups that form a strong competition to the company products in the market and it also shows the close clusters for this group which also can be put into consideration in some cases.

Advantages of k-means include the following-

Relatively simple to implement.
Scales to large data sets.
Guarantees convergence.
Can warm-start the positions of centroids.
Easily adapts to new examples.
Generalizes to clusters of different shapes and sizes, such as elliptical clusters.

Problems with clustering-

As tempting as it's to use clustering to produce strategic groups it is worth mentioning that the clustering process itself is a little bit ambigous and contribution of features to the clustering process can't be easily explained so the overall interpretability of the model forms a challenge.

Disadvantages of k-means includes the following-

Choosing value of k manually to find the optimal k.
Being dependent on initial values- For a low , you can mitigate this dependence by running k-means several times with different initial values and picking the best result. As increases, you need advanced versions of k-means to pick better values of the initial centroids (called k-means seeding).
Clustering data of varying sizes and density- k-means has trouble clustering data where clusters are of varying sizes and density.
Clustering outliers- Centroids can be dragged by outliers, or outliers might get their own cluster instead of being ignored. Consider removing or clipping outliers before clustering.
Scaling with number of dimensions- As the number of dimensions increases, a distance-based similarity measure converges to a constant value between any given examples. Reduce dimensionality either by using PCA on the feature data, or by using “spectral clustering” to modify the clustering algorithm as explained below.

Conclusion-

Data Analysis is the process of systematically applying statistical and/or logical techniques to describe and illustrate, condense and recap, and evaluate data. The dataset used in this report contained cars with their variants with 1200+ model/variants to study over 150+ features upon which Exploratory Data Analysis was performed to gain useful insights about the automative industry in India. Along with using clustering there were too many variables which were taken into consideration which are hard to be traced by other normal methods. The clusters generated by the K-Means model can be used to identify solutions for queries given in the problem statement. Clustering may be not determinant but it can be used to augment the management decision by using it alongside with human intuition to form the right strategic groups.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
Exploratory Data Analysis		Exploratory Data Analysis
Jupyter Notebooks		Jupyter Notebooks
PowerBI		PowerBI
Resources		Resources
SQL		SQL
README.md		README.md
index.html		index.html
styles.css		styles.css

piousannie/Engage

Folders and files

Latest commit

History

Repository files navigation

Car Data Analysis Project for Engage-

Technology Stack Used-

Introduction-

Challenge-

Data-

Exploratory Data Analysis-

Cars Count by Make-

Key Findings-

Cars Count by Car Body Type-

Key Findings-

Cars by Cylinders-

Key Findings-

Cars by Seating Capacity-

Key Findings-

Cars Count by Engine Fuel Type-

Key Findings-

Cars count by Engine Size w.r.t Displacement-

Cars count by Engine Size w.r.t Power-

Relationship between Displacement and Price-

Key Findings-

Relationship between power and price-

Key Findings-

Relationship between price and mileage-

Key Findings-

Checking Ex-Showroom Price distribution using normal and log scales due to the huge difference in prices-

Key Findings-

Box Plot for Ex-Showroom price-

Key Findings-

Box Plot of price vs. body type-

Key Findings-

Plotting an extensive scatter plot grid of more numerical variable to investigate the relation in more data-

Key Findings-

Key Findings-

Check for the overall correlation between variables using Pearson correlation matrix-

Key Findings-

Other Challenges-

Graphs and conclusions for the most sold car models

Most sold cars count by car body type-

Key Findings-

Most sold cars count by make-

Key Findings-

Checking Ex-Showroom Price distribution using normal and log scales due to the huge difference in prices

Most sold cars count by engine fuel capacity-

Most sold cars count by power-

Most sold cars count by torque-

Most sold cars count by car type-

Key Findings-

Most sold cars count by minimum turning radius-

Conclusion-

Clustering

Plotting a 3D scatter plot to check for power, mileage and the car manufacturer-

Plotting a 3D scatter plot to check for price, power and the mileage-

Average prices of each cluster are as follows-

Number of cars existing in each cluster-

Car body types in each cluster-

Price vs Power-

Key Findings-

Price vs Displacement

Key Findings-

Price vs Cylinders

Key Findings-

Price vs Torque

Key Findings-

Cylinders vs Displacement

Key Findings-

Power vs Displacement

Key Findings-

Torque vs Displacement

Key Findings-

Fuel Tank vs Displacement

Key Findings-

Power vs Cylinders

Key Findings-

Torque vs Cylinders

Key Findings-

Packages