Skip to content

An application to demonstrate how the Automotive Industry could use data analysis to take informed decisions.

Notifications You must be signed in to change notification settings

piousannie/Engage

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Car Data Analysis Project for Engage-

For this project, we aim to extensively study and analyse the data given to us to be able to draw significant correlations and understand the patterns and trends in the automotive market. This helps us build conclusive results and a desirable set of specifications that must be required to deliver products that are actively accepted in the market. This helps the manufacturers to understand the market better so that they are able to launch car models that optimise costs and maximise profit.

Technology Stack Used-

HTML5 CSS3 MySQL PowerBI Jupyter

Introduction-

HTML-CSS is used at frontend for the purpose of our web-based application project. Jupyter Notebook is used at the backend to generate the data analytics using Pythonic libraries such as Numpy, Pandas, Matplotlib, SeaBorn and Plotly to execute Exploratory Data Analysis(EDA) and deliver useful graphs and insights. The report follows a mathematical approach using k-means clustering to acheive the objective of identifying clusters of correlation. MySQL is used for identifying relationships and queries among the various variables. Python codes are used for cleaning data. PowerBI is used for interactive data visualization which makes the analysis of data easier.

Challenge-

The automotive industry is one of the largest industries out there it's a 2.6 trillion dollar industry! India's Automotive Industry is worth more that USD 100 billion and contributes 8% of country's total export. The industry accounts for 2.3% of India's GDP and is set to become the 3rd largest in the world by 2025. The industry consists of many categories and subcategories which are thereby constructed by many variables that it can be said that every category is an industry in itself.

For instance the car body type variable is a vital one, here is a list of the car body types used by our data-

  1. SUV(Sport Utility Vehicle)
  2. Sedan
  3. Hatchback
  4. Coupe
  5. MPV(Multi-Purpose Vehicle)
  6. MUV(Multi Utility Vehicle)
  7. Covertible
  8. Crossover
  9. Pick-Up
  10. Sports

And all of the variety above is only regarding the car body type which is only one variable! Similarly, 40+ car manufacterers can be identified in the given dataset. Not to mention the grey areas where some car body types can be irrelevant to customer decision.

Data-

The dataset used in this report contains cars with their variants with 1200+ model/variants to study over 150+ features. Cleaning of data is done by running a series of python codes for removal of units, irregularities, etc. For example- Power and torque are equalised at 1000rpm each for the sake of comparision and the mode of the variables/ features has been used to fill in the empty cells where the data was unavailable.

Additional data was used from the internet to support the analysis and gather an approach for query solution.

Exploratory Data Analysis-

Cars Count by Make-

image

Key Findings-

  1. The Top 5 companies with more than car variants in India are Maruti Suzuki, Hyundai, Mahindra, Tata, and Toyota.
  2. Sports car variants are low.

Cars Count by Car Body Type-

image

Key Findings-

  1. The Top 3 body types in India are Hatchbacks, SUVs and Sedans.

Cars by Cylinders-

image

Key Findings-

  1. Most cars use 4 cylinders followed by 3 and 6 cylinders.

Cars by Seating Capacity-

image

Key Findings-

  1. Most cars are 5 seaters followed by 7 seaters.

Cars Count by Engine Fuel Type-

image

image

Key Findings-

  1. Most cars use Petrol and Diesel.

Cars count by Engine Size w.r.t Displacement-

image

Cars count by Engine Size w.r.t Power-

image

Relationship between Displacement and Price-

image

Key Findings-

  1. Displacement is directly proportional to Price; Higher the price, higher the displacement.

Relationship between power and price-

image

Key Findings-

  1. Horsepower of car is related to car price.
  2. Hatchbacks are the body type with the least horsepower and price.

Relationship between price and mileage-

image

Key Findings-

  1. Expensive cars tend to have worse mileage and vice versa.

Checking Ex-Showroom Price distribution using normal and log scales due to the huge difference in prices-

image

Key Findings-

  1. There is a lot of variance in price that can be checked by plotting a box plot

Box Plot for Ex-Showroom price-

image

Key Findings-

  1. There are a lot of outliers.
  2. Outliers are mostly from the Sports and Coupe category as shown in the box plot below.

Box Plot of price vs. body type-

image

Key Findings-

  1. Car body type affects the price.

Plotting an extensive scatter plot grid of more numerical variable to investigate the relation in more data-

image

Key Findings-

  1. There exists multicollinearity between variables.

Plotting pairwise relationships as pair plot visualization comes handy for Exploratory data analysis(EDA). Pairing plot visualizations from the given data helps find the relationship between them where the variables can be continuous or categorical.

image

Key Findings-

  1. Above graphs give a relationship between Displacement and Price with respect to the Fuel Type.

Check for the overall correlation between variables using Pearson correlation matrix-

image

A correlation of -1.0 indicates a perfect negative correlation, and a correlation of 1.0 indicates a perfect positive correlation. If the correlation coefficient is greater than zero, it is a positive relationship. Conversely, if the value is less than zero, it is a negative relationship.

Key Findings-

  1. Price is positively related to Displacement
  2. Price is positively related to Cylinders
  3. Price is positively related to Power
  4. Price is positively related to Torque
  5. Displacement is positively related to Cylinders
  6. Displacement is positively related to Power
  7. Displacement is positively related to Torque
  8. Displacement is positively related to Fuel Tank
  9. Cylinders is positively related to Power
  10. Cylinders is positively related to Torque
  11. Cylinders is positively related to Fuel Tank
  12. Power is positively related to Torque
  13. Torque is positively related to Width
  14. Torque is positively related to Length
  15. Wheelbase is positively related to Power
  16. Wheelbase is positively related to Torque
  17. Wheelbase is positively related to Length
  18. Doors is positively related to Seating Capacity
  19. Fuel tank is positively related to Displacement.

Other Challenges-

As shown in previous figures clustering the market needs a lot of effort as the separation of clusters is not that obvious. It's now clear that we have to look for many dimensions in order to cluster the automotive market. Since the more features we explore, the harder it is to cluster. These dimensions affect the decision of the buyers and is also preceived as totally different due to the various different mental models of buyers, in other words, price, horsepower and mileage are not everything and some buyers would like to have a long wheel base car, some would like to have wider car all of the previous features, and more, strongly affect the buyer' decisions.

This means that two cars can have a very similar price and milage but one is a van with lots of space and the other is just a four doors sedan, these two cars are precieved as two different categories in the automotive industry so space "length, width and height of the car" can also be a vital factor. So, a three dimensional representation won't tell everything, so thats why we will try to consider clustering to use the very different features associated with each car.

Graphs and conclusions for the most sold car models

Now we can check for the most popular car specifications and combinations from the models with the most units sold, which are stated in order as below-

  1. WagonR
  2. Swift
  3. Dzire
  4. Nexon
  5. Alto
  6. Ertiga
  7. Seltos
  8. Venue
  9. Eeco
  10. Punch
  11. Creta
  12. Vitara Brezza
  13. Celerio
  14. Sonet
  15. Grand i10
  16. Baleno
  17. i20
  18. S-Presso
  19. Amaze
  20. Tiago

Most sold cars count by car body type-

image

Key Findings-

  1. SUVs are the most sold car body type.

Most sold cars count by make-

image

Key Findings-

  1. Maruti Suzuki is the most sold car manufacturer.

Checking Ex-Showroom Price distribution using normal and log scales due to the huge difference in prices

image

Most sold cars count by engine fuel capacity-

image

Most sold cars count by power-

image

Most sold cars count by torque-

image

Most sold cars count by car type-

image

Key Findings-

  1. Manual cars are the most sold car type followed by automatic cars.

Most sold cars count by minimum turning radius-

image

Conclusion-

  1. 78.6% of the most sold cars had 4 cylinders
  2. BS 6 accounted for 67.93% of most sold cars
  3. 5 seaters consist of 91.83% of most sold cars
  4. Maruti Suzuki is the most popular car manufacturer
  5. SUV accounted for 38.59% of most sold cars.
  6. Petrol cars accounted for 61.41% most sold cars.
  7. 79% of most sold cars have 2 airbags.
  8. 68.63% of most sold cars have 5 gears.

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). Clustering divides the data points into a number of groups such that data points in the same groups are more similar to other data points in the same group and dissimilar to the data points in other groups. It is basically a collection of objects on the basis of similarity and dissimilarity between them. It is the main task of an exploratory data analysis, and a common technique for statistical data analysis, used in many fields, including pattern recognition, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning.

The appropriate clustering algorithm and parameter settings (including parameters such as the distance function to use, a density threshold or the number of expected clusters) depend on the individual data set and intended use of the results. Cluster analysis as such is not an automatic task, but an iterative process of knowledge discovery or interactive multi-objective optimization that involves trial and failure. It is often necessary to modify data preprocessing and model parameters until the result achieves the desired properties.

image

The type of clustering we are using here is K-Means clustering K-Means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. K-means clustering minimizes within-cluster variances (squared Euclidean distances).

Now we can check some scatter plots but by adding clusters.

Plotting a 3D scatter plot to check for power, mileage and the car manufacturer-

newplot

Plotting a 3D scatter plot to check for price, power and the mileage-

image

Average prices of each cluster are as follows-

image

Number of cars existing in each cluster-

image

Car body types in each cluster-

image

Price vs Power-

image

Key Findings-

  1. Clusters are strongly affected by the price with clear speration between clusters.
  2. No clusters can be separated for displacement.
  3. Expensive cares tend to have higher power and vice versa.

Price vs Displacement

image

Key Findings-

  1. Cluter separation can be performed on price.
  2. No clusters can be formed for displacement.
  3. Expensive cares tend to have higher displacement and vice versa.

Price vs Cylinders

image

Key Findings-

  1. Cluter separation can be performed on price.
  2. No clusters can be formed for cylinders.
  3. Expensive cares tend to have higher number of cylinders and vice versa.

Price vs Torque

image

Key Findings-

  1. Cluter separation can be performed on price.
  2. No clusters can be formed for torque.
  3. Expensive cares tend to have higher torque and vice versa.

Cylinders vs Displacement

image

Key Findings-

  1. Clusters are too sparse and blurry to be identified hence no conclusions can be made.

Power vs Displacement

image

Key Findings-

  1. Clusters are too sparse and blurry to be identified hence no conclusions can be made.

Torque vs Displacement

image

Key Findings-

  1. Clusters are too sparse and blurry to be identified hence no conclusions can be made.

Fuel Tank vs Displacement

image

Key Findings-

  1. Clusters are too sparse and blurry to be identified hence no conclusions can be made.

Power vs Cylinders

image

Key Findings-

  1. Cluster separation can be performed on cylinders.
  2. No clusters can be formed for Power.
  3. High power cars tend to use a higher number of cylinders in them and vice-versa.

Torque vs Cylinders

image

Key Findings-

  1. Cluter separation can be performed on cylinders.
  2. No clusters can be formed for Torque.
  3. High torque cars tend to use a higher number of cylinders in them and vice-versa.

Fuel Tank Capacity vs Cylinders

image

Key Findings-

  1. Cluter separation can be performed on cylinders.
  2. No clusters can be formed for fuel tank capacity.
  3. Cars with a higher number of cylinders used in them tend to have a higher fuel tank capacity and vice-versa.

Power vs Mileage

image

Key Findings-

  1. Clusters are too close to each other and blurry to be identified hence no conclusions can be made.

Mileage vs ex-showroom price

image

Key Findings-

  1. Cluster separation can be performed on mileage.
  2. No clusters can be formed for ex-showroom price.
  3. Mileage decreases as price increases and vice-versa.

Power vs Fuel Tank-

dsdcd

Key Findings-

  1. Cluster separation can be performed on horsepower.
  2. No clusters can be formed for fuel tank capacity.
  3. Power increases as fuel tank capacity increases and vice-versa.

Why use Clustering?

With clustering there are too many variables taken into consideration which are hard to be traced by other normal methods. The clusters generated by the K-Means model can be used to identify strategic groups that form a strong competition to the company products in the market and it also shows the close clusters for this group which also can be put into consideration in some cases.

Advantages of k-means include the following-

  1. Relatively simple to implement.
  2. Scales to large data sets.
  3. Guarantees convergence.
  4. Can warm-start the positions of centroids.
  5. Easily adapts to new examples.
  6. Generalizes to clusters of different shapes and sizes, such as elliptical clusters.

Problems with clustering-

As tempting as it's to use clustering to produce strategic groups it is worth mentioning that the clustering process itself is a little bit ambigous and contribution of features to the clustering process can't be easily explained so the overall interpretability of the model forms a challenge.

Disadvantages of k-means includes the following-

  1. Choosing value of k manually to find the optimal k.

  2. Being dependent on initial values- For a low , you can mitigate this dependence by running k-means several times with different initial values and picking the best result. As increases, you need advanced versions of k-means to pick better values of the initial centroids (called k-means seeding).

  3. Clustering data of varying sizes and density- k-means has trouble clustering data where clusters are of varying sizes and density.

  4. Clustering outliers- Centroids can be dragged by outliers, or outliers might get their own cluster instead of being ignored. Consider removing or clipping outliers before clustering.

  5. Scaling with number of dimensions- As the number of dimensions increases, a distance-based similarity measure converges to a constant value between any given examples. Reduce dimensionality either by using PCA on the feature data, or by using “spectral clustering” to modify the clustering algorithm as explained below.

Conclusion-

Data Analysis is the process of systematically applying statistical and/or logical techniques to describe and illustrate, condense and recap, and evaluate data. The dataset used in this report contained cars with their variants with 1200+ model/variants to study over 150+ features upon which Exploratory Data Analysis was performed to gain useful insights about the automative industry in India. Along with using clustering there were too many variables which were taken into consideration which are hard to be traced by other normal methods. The clusters generated by the K-Means model can be used to identify solutions for queries given in the problem statement. Clustering may be not determinant but it can be used to augment the management decision by using it alongside with human intuition to form the right strategic groups.

This is the end of the report. Thankyou!

About

An application to demonstrate how the Automotive Industry could use data analysis to take informed decisions.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages