For this project, we aim to extensively study and analyse the data given to us to be able to draw significant correlations and understand the patterns and trends in the automotive market. This helps us build conclusive results and a desirable set of specifications that must be required to deliver products that are actively accepted in the market. This helps the manufacturers to understand the market better so that they are able to launch car models that optimise costs and maximise profit.
HTML-CSS is used at frontend for the purpose of our web-based application project. Jupyter Notebook is used at the backend to generate the data analytics using Pythonic libraries such as Numpy, Pandas, Matplotlib, SeaBorn and Plotly to execute Exploratory Data Analysis(EDA) and deliver useful graphs and insights. The report follows a mathematical approach using k-means clustering to acheive the objective of identifying clusters of correlation. MySQL is used for identifying relationships and queries among the various variables. Python codes are used for cleaning data. PowerBI is used for interactive data visualization which makes the analysis of data easier.
The automotive industry is one of the largest industries out there it's a 2.6 trillion dollar industry! India's Automotive Industry is worth more that USD 100 billion and contributes 8% of country's total export. The industry accounts for 2.3% of India's GDP and is set to become the 3rd largest in the world by 2025. The industry consists of many categories and subcategories which are thereby constructed by many variables that it can be said that every category is an industry in itself.
For instance the car body type variable is a vital one, here is a list of the car body types used by our data-
- SUV(Sport Utility Vehicle)
- Sedan
- Hatchback
- Coupe
- MPV(Multi-Purpose Vehicle)
- MUV(Multi Utility Vehicle)
- Covertible
- Crossover
- Pick-Up
- Sports
And all of the variety above is only regarding the car body type which is only one variable! Similarly, 40+ car manufacterers can be identified in the given dataset. Not to mention the grey areas where some car body types can be irrelevant to customer decision.
The dataset used in this report contains cars with their variants with 1200+ model/variants to study over 150+ features. Cleaning of data is done by running a series of python codes for removal of units, irregularities, etc. For example- Power and torque are equalised at 1000rpm each for the sake of comparision and the mode of the variables/ features has been used to fill in the empty cells where the data was unavailable.
Additional data was used from the internet to support the analysis and gather an approach for query solution.
- The Top 5 companies with more than car variants in India are Maruti Suzuki, Hyundai, Mahindra, Tata, and Toyota.
- Sports car variants are low.
- The Top 3 body types in India are Hatchbacks, SUVs and Sedans.
- Most cars use 4 cylinders followed by 3 and 6 cylinders.
- Most cars are 5 seaters followed by 7 seaters.
- Most cars use Petrol and Diesel.
- Displacement is directly proportional to Price; Higher the price, higher the displacement.
- Horsepower of car is related to car price.
- Hatchbacks are the body type with the least horsepower and price.
- Expensive cars tend to have worse mileage and vice versa.
Checking Ex-Showroom Price distribution using normal and log scales due to the huge difference in prices-
- There is a lot of variance in price that can be checked by plotting a box plot
- There are a lot of outliers.
- Outliers are mostly from the Sports and Coupe category as shown in the box plot below.
- Car body type affects the price.
Plotting an extensive scatter plot grid of more numerical variable to investigate the relation in more data-
- There exists multicollinearity between variables.
Plotting pairwise relationships as pair plot visualization comes handy for Exploratory data analysis(EDA). Pairing plot visualizations from the given data helps find the relationship between them where the variables can be continuous or categorical.
- Above graphs give a relationship between Displacement and Price with respect to the Fuel Type.
A correlation of -1.0 indicates a perfect negative correlation, and a correlation of 1.0 indicates a perfect positive correlation. If the correlation coefficient is greater than zero, it is a positive relationship. Conversely, if the value is less than zero, it is a negative relationship.
- Price is positively related to Displacement
- Price is positively related to Cylinders
- Price is positively related to Power
- Price is positively related to Torque
- Displacement is positively related to Cylinders
- Displacement is positively related to Power
- Displacement is positively related to Torque
- Displacement is positively related to Fuel Tank
- Cylinders is positively related to Power
- Cylinders is positively related to Torque
- Cylinders is positively related to Fuel Tank
- Power is positively related to Torque
- Torque is positively related to Width
- Torque is positively related to Length
- Wheelbase is positively related to Power
- Wheelbase is positively related to Torque
- Wheelbase is positively related to Length
- Doors is positively related to Seating Capacity
- Fuel tank is positively related to Displacement.
As shown in previous figures clustering the market needs a lot of effort as the separation of clusters is not that obvious. It's now clear that we have to look for many dimensions in order to cluster the automotive market. Since the more features we explore, the harder it is to cluster. These dimensions affect the decision of the buyers and is also preceived as totally different due to the various different mental models of buyers, in other words, price, horsepower and mileage are not everything and some buyers would like to have a long wheel base car, some would like to have wider car all of the previous features, and more, strongly affect the buyer' decisions.
This means that two cars can have a very similar price and milage but one is a van with lots of space and the other is just a four doors sedan, these two cars are precieved as two different categories in the automotive industry so space "length, width and height of the car" can also be a vital factor. So, a three dimensional representation won't tell everything, so thats why we will try to consider clustering to use the very different features associated with each car.
Now we can check for the most popular car specifications and combinations from the models with the most units sold, which are stated in order as below-
- WagonR
- Swift
- Dzire
- Nexon
- Alto
- Ertiga
- Seltos
- Venue
- Eeco
- Punch
- Creta
- Vitara Brezza
- Celerio
- Sonet
- Grand i10
- Baleno
- i20
- S-Presso
- Amaze
- Tiago
- SUVs are the most sold car body type.
- Maruti Suzuki is the most sold car manufacturer.
Checking Ex-Showroom Price distribution using normal and log scales due to the huge difference in prices
- Manual cars are the most sold car type followed by automatic cars.
- 78.6% of the most sold cars had 4 cylinders
- BS 6 accounted for 67.93% of most sold cars
- 5 seaters consist of 91.83% of most sold cars
- Maruti Suzuki is the most popular car manufacturer
- SUV accounted for 38.59% of most sold cars.
- Petrol cars accounted for 61.41% most sold cars.
- 79% of most sold cars have 2 airbags.
- 68.63% of most sold cars have 5 gears.
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). Clustering divides the data points into a number of groups such that data points in the same groups are more similar to other data points in the same group and dissimilar to the data points in other groups. It is basically a collection of objects on the basis of similarity and dissimilarity between them. It is the main task of an exploratory data analysis, and a common technique for statistical data analysis, used in many fields, including pattern recognition, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning.
The appropriate clustering algorithm and parameter settings (including parameters such as the distance function to use, a density threshold or the number of expected clusters) depend on the individual data set and intended use of the results. Cluster analysis as such is not an automatic task, but an iterative process of knowledge discovery or interactive multi-objective optimization that involves trial and failure. It is often necessary to modify data preprocessing and model parameters until the result achieves the desired properties.
The type of clustering we are using here is K-Means clustering K-Means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. K-means clustering minimizes within-cluster variances (squared Euclidean distances).
Now we can check some scatter plots but by adding clusters.
- Clusters are strongly affected by the price with clear speration between clusters.
- No clusters can be separated for displacement.
- Expensive cares tend to have higher power and vice versa.
- Cluter separation can be performed on price.
- No clusters can be formed for displacement.
- Expensive cares tend to have higher displacement and vice versa.
- Cluter separation can be performed on price.
- No clusters can be formed for cylinders.
- Expensive cares tend to have higher number of cylinders and vice versa.
- Cluter separation can be performed on price.
- No clusters can be formed for torque.
- Expensive cares tend to have higher torque and vice versa.
- Clusters are too sparse and blurry to be identified hence no conclusions can be made.
- Clusters are too sparse and blurry to be identified hence no conclusions can be made.
- Clusters are too sparse and blurry to be identified hence no conclusions can be made.
- Clusters are too sparse and blurry to be identified hence no conclusions can be made.
- Cluster separation can be performed on cylinders.
- No clusters can be formed for Power.
- High power cars tend to use a higher number of cylinders in them and vice-versa.
- Cluter separation can be performed on cylinders.
- No clusters can be formed for Torque.
- High torque cars tend to use a higher number of cylinders in them and vice-versa.
- Cluter separation can be performed on cylinders.
- No clusters can be formed for fuel tank capacity.
- Cars with a higher number of cylinders used in them tend to have a higher fuel tank capacity and vice-versa.
- Clusters are too close to each other and blurry to be identified hence no conclusions can be made.
- Cluster separation can be performed on mileage.
- No clusters can be formed for ex-showroom price.
- Mileage decreases as price increases and vice-versa.
- Cluster separation can be performed on horsepower.
- No clusters can be formed for fuel tank capacity.
- Power increases as fuel tank capacity increases and vice-versa.
With clustering there are too many variables taken into consideration which are hard to be traced by other normal methods. The clusters generated by the K-Means model can be used to identify strategic groups that form a strong competition to the company products in the market and it also shows the close clusters for this group which also can be put into consideration in some cases.
- Relatively simple to implement.
- Scales to large data sets.
- Guarantees convergence.
- Can warm-start the positions of centroids.
- Easily adapts to new examples.
- Generalizes to clusters of different shapes and sizes, such as elliptical clusters.
As tempting as it's to use clustering to produce strategic groups it is worth mentioning that the clustering process itself is a little bit ambigous and contribution of features to the clustering process can't be easily explained so the overall interpretability of the model forms a challenge.
-
Choosing value of k manually to find the optimal k.
-
Being dependent on initial values- For a low , you can mitigate this dependence by running k-means several times with different initial values and picking the best result. As increases, you need advanced versions of k-means to pick better values of the initial centroids (called k-means seeding).
-
Clustering data of varying sizes and density- k-means has trouble clustering data where clusters are of varying sizes and density.
-
Clustering outliers- Centroids can be dragged by outliers, or outliers might get their own cluster instead of being ignored. Consider removing or clipping outliers before clustering.
-
Scaling with number of dimensions- As the number of dimensions increases, a distance-based similarity measure converges to a constant value between any given examples. Reduce dimensionality either by using PCA on the feature data, or by using “spectral clustering” to modify the clustering algorithm as explained below.
Data Analysis is the process of systematically applying statistical and/or logical techniques to describe and illustrate, condense and recap, and evaluate data. The dataset used in this report contained cars with their variants with 1200+ model/variants to study over 150+ features upon which Exploratory Data Analysis was performed to gain useful insights about the automative industry in India. Along with using clustering there were too many variables which were taken into consideration which are hard to be traced by other normal methods. The clusters generated by the K-Means model can be used to identify solutions for queries given in the problem statement. Clustering may be not determinant but it can be used to augment the management decision by using it alongside with human intuition to form the right strategic groups.