# Preparing for a job interview for DS or ML-engineer

## Top 10 pieces of advice for preparing for a data science interview

1. Study the job description and requirements carefully, and make sure you understand the company's needs and goals.

2. Review the fundamentals of statistics, linear algebra, calculus, and probability theory. Make sure you have a solid understanding of these subjects.

3. Practice coding in the programming languages that are required for the job, such as Python, R, or SQL. Focus on building proficiency in these languages by working on sample projects and exercises.

4. Develop a strong foundation in data structures and algorithms. These are important for solving complex data science problems.

5. Study machine learning algorithms and their applications. This will help you understand the basics of supervised and unsupervised learning, regression, clustering, and other techniques.

6. Work on projects to demonstrate your skills and expertise. This can include building predictive models, visualizing data, or working with real-world datasets.

7. Practice explaining complex technical concepts in simple terms. Being able to communicate technical concepts to non-technical stakeholders is an important skill for a data scientist.

8. Stay up to date with the latest industry trends and advancements in data science. This can include attending conferences, reading industry publications, and following experts on social media.

9. Prepare for common data science interview questions, such as those related to hypothesis testing, data wrangling, and model selection.

10. Be confident, enthusiastic, and personable. Show your passion for data science and your willingness to learn and grow in the field. Remember, the interview is also an opportunity for you to learn more about the company and its culture.

## 10 things you should know


1. Know how to write code on a whiteboard or paper
2. Know basic python control flow: how to use loops, if-else statements
3. Be able to discuss how you`ve used python
4. Know how to solve common interview problems
5. Know basic python data types, when to use them, how to iterate over each one
6. Know how to use list comprehensions
7. Know how to use generators
8. Know the basics of OOP. Know how to write general template for a Class, how to inherit from a class
9. Have python related questions ready to ask your interviewer. It can bring up areas that you are strong in. And be ready to answer the questions that will be induced by your question
10. Know the basics of other technologies (SQL, how to navigate through command line, git). Go and look at the job description.

## Common interview questions

### Python


1. What is Python?

        Python is a high-level programming language that is interpreted and dynamically typed. It is widely used for various applications, including web development, data analysis, and artificial intelligence.



2. What are the advantages of using Python over other programming languages?
   
        Python has several advantages, including its simplicity, readability, extensive libraries, platform independence, and a large community of developers for support.



3. What is PEP8, and why is it important?

        PEP8 is a set of guidelines for writing Python code, promoting consistency and readability. It is important because it improves code quality, makes it more maintainable, and enhances collaboration among developers.



4. What is a Python module?
   
       A Python module is a file containing Python code that can be imported and used in other Python programs. It allows code organization and reusability.



5. What is the difference between a list and a tuple in Python?
   
       In Python, a list is mutable, meaning its elements can be changed, added, or removed. In contrast, a tuple is immutable, and its elements cannot be modified once defined.



6. What are decorators in Python?
   
       Decorators are functions that modify the behavior of other functions or classes. They are denoted with the '@' symbol and can add functionality, such as logging or authentication, to existing code.



7. What is the purpose of __init__.py in Python?

       '__init__.py' is a special file used to mark a directory as a Python package. It can contain initialization code executed when the package is imported and defines the structure of the package.



8. What is the difference between a generator and a list in Python?
   
       In Python, a generator is an iterable that generates values on the fly, conserving memory. A list, on the other hand, is a collection of elements stored in memory. Generators are typically used when working with large datasets.



9. What are the advantages of using virtual environments in Python?
   
       Virtual environments enable the isolation of Python environments, allowing developers to manage different packages and dependencies for different projects. They prevent conflicts and ensure consistent environments.

10. How do you handle exceptions in Python?
    
        Exceptions in Python can be handled using try-except blocks. The code within the try block is executed, and if an exception occurs, it is caught and handled in the corresponding except block.



11. How do you perform file I/O operations in Python?
    
        File I/O (input/output) operations in Python are performed using built-in functions like 'open()', 'read()', 'write()', and 'close()'. These functions enable reading from and writing to files.



12. What is the difference between a shallow copy and a deep copy in Python?
    
        A shallow copy creates a new object but references the same memory locations as the original object, while a deep copy creates a completely new object with its own memory locations.



13. What is the difference between `==` and `is` in Python?
    
        '==' compares the values of two objects, checking if they are equal. 'is' checks if two objects are the same instance in memory, i.e., they occupy the same memory location.



14. What is the difference between a class method and an instance method in Python?
    
        A class method is bound to the class itself and operates on class-level data. An instance method, on the other hand, operates on instance-specific data and is bound to an instance of the class.



15. How do you create a dictionary in Python?
    
        A dictionary in Python is created using curly braces {} or the built-in 'dict()' function. Key-value pairs are enclosed within the braces, with a colon separating each key from its corresponding value.



16. What is the difference between `map` and `filter` functions in Python?
   
           The 'map' function in Python applies a given function to each item in an iterable and returns a new iterable with the results. It transforms the elements of the iterable based on the provided function.
           The 'filter' function, on the other hand, applies a predicate function to each item in an iterable and returns a new iterable with only the elements for which the predicate function evaluates to True. It filters out elements that do not satisfy the given condition.

17. What is the difference between a set and a frozenset in Python?
       
           A set in Python is an unordered collection of unique elements, where duplicates are automatically removed. Sets are mutable, allowing addition and removal of elements.
           A frozenset, on the other hand, is an immutable version of a set. Once created, its elements cannot be modified. Frozensets are useful when you need an immutable set.



18. What is the use of *args and **kwargs in Python?

           '*args' is used to pass a variable number of non-keyworded arguments to a function. It allows you to pass any number of arguments to a function without explicitly defining them.
           '**kwargs' is used to pass a variable number of keyword arguments to a function. It allows you to pass any number of named arguments to a function without explicitly defining them.

19. How do you create a class in Python?

           To create a class in Python, you use the 'class' keyword followed by the class name. You can define attributes and methods within the class to define its properties and behaviors.



20. What is a lambda function in Python?
   
           A lambda function is an anonymous function in Python. It is defined using the 'lambda' keyword and is typically used for simple, one-line functions without a formal function definition. Lambda functions are often used in conjunction with higher-order functions like 'map' and 'filter'.



21. What are the advantages of using list comprehensions in Python?
   
           List comprehensions in Python provide a concise and readable way to create lists based on existing lists or other iterables. They can simplify the code and make it more expressive by combining looping and conditional statements in a single line.
           
           

22. How do you import modules in Python?

           Modules can be imported in Python using the 'import' statement. You can import the entire module or specific functions/classes from the module. Additionally, you can use 'as' to assign an alias to the module or its components.



23. What is the difference between a local and a global variable in Python?

           A local variable is defined within a specific function or block and is accessible only within that scope. It cannot be accessed outside of its defined scope.
           A global variable, on the other hand, is defined outside of any function and can be accessed throughout the program. It has a global scope and can be accessed from any function or block within the program.



24. What is the use of `__name__` attribute in Python?

           The '__name__' attribute in Python is a special attribute that is automatically defined for every module. When the module is run directly, '__name__' is set to '__main__'. It can be used to differentiate between when a module is run directly or imported as a module.



25. What is the difference between a private and a protected method in Python?
   
           In Python, both private and protected methods are ways to control access to class methods and variables. 

        A private method is denoted by a double underscore prefix before the method name, such as `__method()`. This method can only be accessed within the class definition and not from outside the class. This is done to prevent accidental modification or access to a method or variable that is meant to be used only within the class.

        A protected method is denoted by a single underscore prefix before the method name, such as `_method()`. This method can be accessed from within the class definition and any subclass, but not from outside the class or subclass. This is done to allow subclasses to access the method or variable but still prevent direct access from outside the class hierarchy.

        Here is an example to illustrate the difference between private and protected methods:

```Python
class MyClass:
    def __init__(self):
        self.__private_method()
        self._protected_method()

    def __private_method(self):
        print("This is a private method")

    def _protected_method(self):
        print("This is a protected method")

class MySubclass(MyClass):
    def __init__(self):
        super().__init__()

    def access_protected_method(self):
        self._protected_method()

obj = MyClass() # Output: "This is a private method" and "This is a protected method"
obj.__private_method() # Raises an AttributeError as private methods cannot be accessed from outside the class
obj._protected_method() # Can be accessed from outside the class but it's not recommended

sub_obj = MySubclass() # Output: "This is a private method" and "This is a protected method"
sub_obj.access_protected_method() # Output: "This is a protected method"
```

    In the example above, `__private_method()` can only be called within the `MyClass` definition, while `_protected_method()` can be accessed by any subclass of `MyClass`, but not from outside the class hierarchy.
   
   
26. How do you create a thread in Python?

        Threads can be created in Python by importing the `threading` module and then calling the `Thread()` function with the target function as an argument. Here is an example:

```python
import threading

def worker():
    print('Thread started')

t = threading.Thread(target=worker)
t.start()
```

27. What is the difference between a stack and a queue in Python?

        A stack and a queue are both data structures in Python, but they have different ways of organizing data. A stack follows the Last-In-First-Out (LIFO) principle, which means that the last element added to the stack is the first element to be removed. On the other hand, a queue follows the First-In-First-Out (FIFO) principle, which means that the first element added to the queue is the first element to be removed.



28. What are the advantages of using Numpy in Python?

        NumPy is a Python package used for scientific computing. Some of the advantages of using NumPy are:

        - It provides fast mathematical operations on arrays and matrices.
        - It offers a wide range of mathematical functions.
        - It can handle large arrays and matrices efficiently.
        - It integrates well with other scientific computing libraries like SciPy and Pandas.



29. What is the difference between a tuple and a dictionary in Python?

        A tuple is a collection of ordered and immutable elements, while a dictionary is an unordered collection of key-value pairs. In a tuple, the elements are accessed using indices, while in a dictionary, the elements are accessed using keys. Also, the elements in a tuple cannot be changed once they are created, while the elements in a dictionary can be modified.



30. What is the use of pip in Python?

        Pip is a package manager for Python that is used to install, upgrade, and manage Python packages and their dependencies. Pip makes it easy to install packages from the Python Package Index (PyPI) and other repositories. It also allows you to uninstall packages and manage different versions of the same package.

### SQL

1. What is SQL, and why is it important for data analysis?
2. What is a database, and what are some common types of databases?
3. What is a table, and what are some common data types used in tables?
4. What are primary keys and foreign keys in SQL, and why are they important?
5. What are some common SQL commands, and what do they do?
6. What is a join, and what are some common types of joins?
7. What is normalization in database design, and why is it important?
8. What is denormalization, and when might it be useful?
9. What is a view in SQL, and how can it be useful?
10. What is a stored procedure in SQL, and how can it be used?
11. What is a trigger in SQL, and when might it be used?
12. How can you use SQL to manipulate data, such as sorting, filtering, and aggregating?
13. What are some common SQL functions, and what do they do?
14. How can you use SQL to extract data from a database?
15. What is a subquery in SQL, and how can it be used?
16. What is a correlated subquery in SQL, and how is it different from a regular subquery?
17. What is a temporary table in SQL, and how can it be useful?
18. How can you use SQL to perform time-series analysis on data?
19. How can you use SQL to calculate rolling averages, cumulative sums, and other rolling calculations?
20. How can you use SQL to perform cohort analysis on data?
21. How can you use SQL to perform A/B testing on data?
22. What are some common performance tuning techniques for SQL queries?
23. What is an index in SQL, and how can it be useful for performance tuning?
24. What is a query plan, and how can it be useful for performance tuning?
25. What is a database schema, and how can it be useful for organizing data?
26. What are some common database management tools, and how do they work?
27. What is data warehousing, and how does it differ from a traditional database?
28. What is ETL, and how can it be used in data analysis?
29. How can you use SQL to perform data cleaning and transformation?
30. How can you use SQL to perform machine learning tasks, such as clustering or classification?




1. SQL stands for Structured Query Language, and it is used for managing relational databases. It allows users to retrieve, insert, update, and delete data in a structured manner, and is important for data analysis because it enables users to extract meaningful insights from large and complex datasets.
2. A database is a collection of data that is organized in a structured manner. Some common types of databases include relational databases, NoSQL databases, and graph databases.
3. A table is a collection of data organized into rows and columns. Some common data types used in tables include integers, floats, dates, strings, and Boolean values.
4. A primary key is a unique identifier for a record in a table, while a foreign key is a column in one table that references the primary key of another table. They are important because they enable users to establish relationships between tables and ensure data integrity.
5. Some common SQL commands include SELECT (used to retrieve data from a table), INSERT (used to add new data to a table), UPDATE (used to modify existing data in a table), DELETE (used to remove data from a table), and CREATE TABLE (used to create a new table).
6. A join is used to combine data from two or more tables based on a common column. Some common types of joins include INNER JOIN (returns only the rows that have matching values in both tables), LEFT JOIN (returns all rows from the left table and matching rows from the right table), and RIGHT JOIN (returns all rows from the right table and matching rows from the left table).
7. Normalization is the process of organizing data in a database in such a way that reduces redundancy and ensures data consistency. It is important because it helps prevent data anomalies and makes it easier to manage and maintain large databases.
8. Denormalization is the process of intentionally adding redundancy to a database in order to improve query performance. It is useful in situations where read operations outnumber write operations, and where data consistency is not a critical concern.
9. A view in SQL is a virtual table that is based on the result of a SELECT statement. It can be useful for simplifying complex queries, restricting access to sensitive data, and improving performance by reducing the amount of data that needs to be retrieved.
10. A stored procedure in SQL is a precompiled set of SQL statements that is stored in the database and can be called by other programs or scripts. It can be used to simplify complex operations, improve performance by reducing network traffic, and enforce data consistency by encapsulating business logic within the database.
11. A trigger in SQL is a special type of stored procedure that is automatically executed in response to certain database events. It can be used to enforce business rules, maintain data integrity, and automate data-related tasks.

12. SQL can be used to manipulate data in various ways, such as sorting data in ascending or descending order, filtering data based on certain criteria using the WHERE clause, and aggregating data using functions such as COUNT, SUM, AVG, and MAX/MIN.

13. Common SQL functions include aggregate functions such as COUNT, SUM, AVG, MAX/MIN, and string functions such as CONCAT, SUBSTR, and UPPER/LOWER. There are also date/time functions, mathematical functions, and conditional functions such as IF/ELSE and CASE.

14. SQL can be used to extract data from a database by writing SELECT statements that specify the columns to retrieve and the table(s) to retrieve them from. The WHERE clause can be used to filter the data based on certain criteria, and the ORDER BY clause can be used to sort the data.

15. A subquery in SQL is a query that is nested inside another query. It can be used to retrieve data that will be used as a condition in the outer query, or to retrieve data for comparison with data in the outer query.

16. A correlated subquery is a type of subquery that uses values from the outer query in its WHERE clause. It differs from a regular subquery in that it is executed once for each row in the outer query, whereas a regular subquery is executed only once.

17. A temporary table in SQL is a table that is created and used only for the duration of a single database session. It can be useful for storing intermediate results or for breaking down complex queries into smaller, more manageable pieces.

18. SQL can be used to perform time-series analysis by using date/time functions to extract and manipulate data, and by using techniques such as moving averages and exponential smoothing to analyze trends and patterns in the data.

19. Rolling calculations such as rolling averages and cumulative sums can be calculated using SQL by using the OVER() clause in conjunction with aggregate functions such as SUM and AVG.

20. Cohort analysis is a technique used in marketing and customer analytics to study the behavior of groups of customers over time. SQL can be used to perform cohort analysis by grouping customers into cohorts based on their characteristics, and then analyzing the behavior of each cohort over time using techniques such as pivot tables and time-series analysis.

21. A/B testing involves comparing two versions of a product, process, or system to see which one performs better. In SQL, this can be done by selecting a subset of data and applying different treatments to each group. The results can then be compared to see which treatment is more effective.

22. Performance tuning techniques for SQL queries include optimizing query structure, using appropriate indexes, limiting data retrieval, and optimizing database configuration.

23. An index in SQL is a data structure that speeds up the retrieval of data from a table. It can be useful for performance tuning by allowing queries to access data more quickly.

24. A query plan is a blueprint that shows how SQL queries are executed. It can be useful for performance tuning by identifying areas of inefficiency in the query process.

25. A database schema is a blueprint that defines how data is organized in a database. It can be useful for organizing data by providing a clear structure for tables, fields, and relationships between data.

26. Common database management tools include MySQL, Oracle, Microsoft SQL Server, and PostgreSQL. These tools allow users to manage databases, create tables, and write SQL queries.

27. Data warehousing involves storing large amounts of data from various sources in a single location, where it can be easily accessed and analyzed. This differs from a traditional database, which typically stores data in smaller, more structured tables.

28. ETL stands for extract, transform, and load, and is a process used to extract data from multiple sources, transform it into a format suitable for analysis, and load it into a database or data warehouse.

29. SQL can be used for data cleaning and transformation by filtering out irrelevant data, removing duplicates, fixing errors, and transforming data into a more useful format.

30. SQL can be used for machine learning tasks such as clustering or classification by applying statistical models and algorithms to data in a database or data warehouse. This can involve selecting relevant features, creating models, and evaluating model performance.

### Math


1. What is linear regression? Explain the difference between simple and multiple linear regression.
2. What is logistic regression? Explain the difference between logistic regression and linear regression.
3. What is overfitting and underfitting? How can you avoid these issues in machine learning?
4. What is regularization? How does it help in reducing overfitting?
5. What is the difference between L1 and L2 regularization?
6. What is the bias-variance tradeoff? How do you handle this tradeoff in machine learning?
7. What is the difference between classification and regression problems?
8. What are decision trees? Explain the concept of entropy and information gain in decision trees.
9. What is KNN (k-nearest neighbors) algorithm? Explain how it works.
10. What is Naive Bayes algorithm? Explain how it works.
11. What is clustering? What are the different types of clustering?
12. Explain the concept of dimensionality reduction. What are the different techniques for dimensionality reduction?
13. What is PCA (Principal Component Analysis)? How is it used in data science?
14. What is SVM (Support Vector Machine)? How does it work?
15. What is ensemble learning? What are the different techniques used in ensemble learning?
16. What is gradient descent? How is it used in machine learning?
17. What is the difference between a parametric and a non-parametric model?
18. What is the Central Limit Theorem? Explain its importance in statistics.
19. What is hypothesis testing? Explain the difference between Type I and Type II errors.
20. What is the p-value? Explain its importance in hypothesis testing.
21. What is the confidence interval? Explain its importance in statistics.
22. What is the correlation coefficient? Explain its significance in data analysis.
23. What is covariance? Explain its significance in data analysis.
24. What is the difference between covariance and correlation?
25. What is the Law of Large Numbers? Explain its importance in statistics.
26. What is the Monte Carlo Simulation? How is it used in data analysis?
27. What is a Markov Chain? Explain its use in data analysis.
28. What is the Bayes' Theorem? How is it used in data analysis?
29. What is the difference between a sample and a population? How do you estimate population parameters from a sample?
30. What is the Chi-squared test? Explain its importance in hypothesis testing.




1. Linear regression is a statistical method used to model the relationship between two continuous variables, where one variable is considered the dependent variable, and the other variable is considered the independent variable. Simple linear regression involves one independent variable, while multiple linear regression involves multiple independent variables.
2. Logistic regression is a statistical method used to model the probability of a binary outcome based on one or more predictor variables. The main difference between logistic regression and linear regression is that logistic regression predicts a binary outcome, while linear regression predicts a continuous outcome.
3. Overfitting refers to a model that is too complex and fits the training data too closely, resulting in poor performance on new data. Underfitting refers to a model that is too simple and fails to capture the underlying patterns in the data. To avoid overfitting and underfitting, it is important to use appropriate model complexity and regularization techniques.
4. Regularization is a technique used in machine learning to reduce overfitting by adding a penalty term to the loss function that discourages large parameter values. By doing so, regularization reduces the model complexity and prevents it from fitting the training data too closely.
5. L1 regularization adds the absolute value of the parameter values to the loss function, while L2 regularization adds the square of the parameter values. L1 regularization tends to result in sparse parameter values, while L2 regularization tends to result in smaller parameter values.
6. The bias-variance tradeoff refers to the tradeoff between the model's ability to fit the training data (bias) and its ability to generalize to new data (variance). To handle this tradeoff, it is important to use appropriate model complexity, regularization techniques, and evaluation metrics.
7. Classification problems involve predicting a categorical or discrete outcome, while regression problems involve predicting a continuous outcome.
8. Decision trees are a popular machine learning method used for classification and regression problems. Entropy is a measure of the impurity of a set of examples, while information gain is the reduction in entropy achieved by splitting the set of examples based on a given attribute.
9. KNN (k-nearest neighbors) algorithm is a machine learning method used for classification and regression problems. The algorithm works by finding the k-nearest neighbors to a given data point and predicting the outcome based on the majority vote of the neighbors.

10. What is Naive Bayes algorithm? Explain how it works.
           
           Naive Bayes algorithm is a machine learning method used for classification problems. The algorithm works by assuming that the predictors are conditionally independent given the outcome and estimating the probability of each outcome based on the Bayes' theorem.
            
            Naive Bayes is a probabilistic algorithm used for classification problems. It is based on Bayes' theorem which states that the probability of an event occurring given the occurrence of another event can be calculated by multiplying the probability of the second event occurring given the occurrence of the first event with the probability of the first event occurring and dividing the result by the probability of the second event occurring. In the context of classification, the algorithm uses the probability of each feature in a data point belonging to a particular class to predict the class of the data point. Naive Bayes assumes that the features are independent of each other, hence the term "naive". There are three main types of Naive Bayes algorithms: Gaussian Naive Bayes, Multinomial Naive Bayes, and Bernoulli Naive Bayes.

11. What is clustering? What are the different types of clustering?
    
        Clustering is a technique used in unsupervised learning to group similar data points together. It is used when there is no prior knowledge about the classes or labels of the data. The objective is to find natural groupings in the data such that the data points within a group are similar to each other and different from the points in other groups. The different types of clustering are:

    * K-means clustering: This is a centroid-based clustering algorithm that tries to partition the data into K clusters, where K is a pre-defined number. The algorithm works by iteratively assigning data points to the closest cluster centroid and then updating the centroids based on the new data points assigned to the cluster.
    * Hierarchical clustering: This is a clustering algorithm that creates a tree-like structure of clusters by recursively splitting or merging clusters based on their similarity.
    * Density-based clustering: This is a clustering algorithm that identifies dense regions in the data and considers them as clusters. The most popular density-based clustering algorithm is DBSCAN (Density-Based Spatial Clustering of Applications with Noise).

## Common interview problems

### Python

1. Write a function to calculate the factorial of a number.
2. Reverse a string without using the reverse() function.
3. Write a program to count the number of vowels in a string.
4. Create a program that takes a list of integers as input and returns the second highest number.
5. Implement a stack using a list.
6. Write a program to check if a string is a palindrome.
7. Create a function that removes all duplicates from a list.
8. Write a program that prints the first 100 Fibonacci numbers.
9. Implement a binary search algorithm.
10. Write a program to check if a number is prime.
11. Create a program that takes two lists as input and returns the common elements.
12. Write a function that returns the sum of all the elements in a list.
13. Create a program that generates a random number between 1 and 100 and asks the user to guess it.
14. Write a function that checks if a given year is a leap year.
15. Implement a bubble sort algorithm.
16. Write a program to remove all whitespaces from a string.
17. Create a function that finds the maximum value in a list.
18. Write a program that calculates the area of a circle.
19. Implement a merge sort algorithm.
20. Write a program to find the largest number in a list of integers.
21. Create a program that generates a multiplication table.
22. Write a function that reverses a list.
23. Implement a quick sort algorithm.
24. Write a program to find the smallest number in a list of integers.
25. Create a function that checks if a string contains only digits.
26. Write a program that calculates the sum of all even numbers from 1 to 100.
27. Implement a selection sort algorithm.
28. Write a program to check if a string is a valid palindrome.
29. Create a function that checks if a string is a valid email address.
30. Write a program to find the average of a list of numbers.

### SQL

1. Write a query to select all the columns from a table.
2. Write a query to select specific columns from a table.
3. Write a query to filter rows using the WHERE clause.
4. Write a query to sort rows using the ORDER BY clause.
5. Write a query to limit the number of rows using the LIMIT clause.
6. Write a query to join two tables using an inner join.
7. Write a query to join two tables using a left join.
8. Write a query to join two tables using a right join.
9. Write a query to join two tables using a full outer join.
10. Write a query to calculate the average of a column using the AVG function.
11. Write a query to calculate the sum of a column using the SUM function.
12. Write a query to calculate the minimum value of a column using the MIN function.
13. Write a query to calculate the maximum value of a column using the MAX function.
14. Write a query to group rows using the GROUP BY clause.
15. Write a query to filter groups using the HAVING clause.
16. Write a query to count the number of rows in a table using the COUNT function.
17. Write a query to count the number of rows in a table that meet a specific condition using the COUNT function.
18. Write a query to calculate the percentage of rows in a table that meet a specific condition using the COUNT function.
19. Write a query to calculate the median value of a column using the MEDIAN function.
20. Write a query to calculate the mode value of a column using the MODE function.
21. Write a query to calculate the standard deviation of a column using the STDDEV function.
22. Write a query to calculate the variance of a column using the VARIANCE function.
23. Write a query to calculate the difference between two dates using the DATEDIFF function.
24. Write a query to extract the year from a date using the YEAR function.
25. Write a query to extract the month from a date using the MONTH function.
26. Write a query to extract the day from a date using the DAY function.
27. Write a query to convert a string to a date using the STR_TO_DATE function.
28. Write a query to extract a substring from a string using the SUBSTRING function.
29. Write a query to concatenate two strings using the CONCAT function.
30. Write a query to convert a string to uppercase using the UPPER function.



1. `SELECT * FROM table_name;`
2. `SELECT column1, column2 FROM table_name;`
3. `SELECT * FROM table_name WHERE condition;`
4. `SELECT * FROM table_name ORDER BY column_name;`
5. `SELECT * FROM table_name LIMIT num_rows;`
6. `SELECT * FROM table1 INNER JOIN table2 ON table1.column_name = table2.column_name;`
7. `SELECT * FROM table1 LEFT JOIN table2 ON table1.column_name = table2.column_name;`
8. `SELECT * FROM table1 RIGHT JOIN table2 ON table1.column_name = table2.column_name;`
9. `SELECT * FROM table1 FULL OUTER JOIN table2 ON table1.column_name = table2.column_name;`
10. `SELECT AVG(column_name) FROM table_name;`
11. `SELECT SUM(column_name) FROM table_name;`
12. `SELECT MIN(column_name) FROM table_name;`
13. `SELECT MAX(column_name) FROM table_name;`
14. `SELECT column1, column2, COUNT(*) FROM table_name GROUP BY column1, column2;`
15. `SELECT column1, COUNT(*) FROM table_name GROUP BY column1 HAVING COUNT(*) > 1;`
16. `SELECT COUNT(*) FROM table_name;`
17. `SELECT COUNT(*) FROM table_name WHERE condition;`
18. `(SELECT COUNT(*) FROM table_name WHERE condition) / (SELECT COUNT(*) FROM table_name) * 100;`
19. `SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY column_name) FROM table_name;`
20. `SELECT column_name, COUNT(*) FROM table_name GROUP BY column_name ORDER BY COUNT(*) DESC LIMIT 1;`
21. `SELECT STDDEV(column_name) FROM table_name;`
22. `SELECT VARIANCE(column_name) FROM table_name;`
23. `SELECT DATEDIFF(end_date, start_date) FROM table_name;`
24. `SELECT YEAR(date_column) FROM table_name;`
25. `SELECT MONTH(date_column) FROM table_name;`
26. `SELECT DAY(date_column) FROM table_name;`
27. `SELECT STR_TO_DATE(date_string, format_string) FROM table_name;`
28. `SELECT SUBSTRING(string_column, start_position, length) FROM table_name;`
29. `SELECT CONCAT(string1, string2) FROM table_name;`
30. `SELECT UPPER(string_column) FROM table_name;`

### Maths

1. What is the difference between variance and standard deviation?
2. How do you calculate correlation?
3. What is the central limit theorem?
4. What is a p-value and what does it represent?
5. How do you calculate probability?
6. What is a normal distribution and how is it used in statistics?
7. What is the difference between a population and a sample?
8. How do you calculate the mean, median, and mode?
9. What is regression analysis and how is it used in data science?
10. What is the difference between a parametric and non-parametric test?
11. What is the difference between precision and recall?
12. What is a confidence interval?
13. How do you interpret a box plot?
14. What is Bayes' theorem?
15. What is a hypothesis test and how is it used in data science?
16. What is the difference between a discrete and continuous variable?
17. What is overfitting and how can you prevent it?
18. What is the difference between a hypothesis and a theory?
19. What is the difference between correlation and causation?
20. What is the difference between an outlier and a leverage point?
21. What is the difference between a null and alternative hypothesis?
22. What is a decision tree and how is it used in data science?
23. How do you calculate a z-score?
24. What is a t-test and how is it used in data science?
25. How do you calculate the standard error?
26. What is the difference between a one-tailed and two-tailed test?
27. How do you calculate the coefficient of determination?
28. What is the difference between a histogram and a bar chart?
29. What is a chi-squared test and how is it used in data science?
30. How do you calculate the interquartile range?

Sure! Here are the answers to the questions:

1. The difference between variance and standard deviation is that variance measures the average squared deviation from the mean, while standard deviation is the square root of variance and represents the average deviation from the mean.

2. Correlation is a statistical measure that quantifies the relationship between two variables. It is calculated using the covariance of the variables divided by the product of their standard deviations.

3. The central limit theorem states that when independent random variables are added, their sum tends toward a normal distribution, regardless of the shape of the original variables' distribution. It is a fundamental concept in statistics, stating that the sampling distribution of the mean approaches a normal distribution as the sample size increases.

4. A p-value is a measure of the evidence against a null hypothesis. It represents the probability of observing a test statistic as extreme as, or more extreme than, the one computed from the sample data, assuming the null hypothesis is true.

5. Probability is a measure of the likelihood that an event will occur. It is calculated by dividing the number of favorable outcomes by the total number of possible outcomes.

6. A normal distribution is a probability distribution that is symmetric and bell-shaped. It is widely used in statistics to model many natural phenomena due to its desirable properties and the central limit theorem.

7. In statistics, a population refers to the entire set of individuals, objects, or events of interest, while a sample is a subset of the population that is used to draw conclusions about the population.

8. Mean is calculated by summing up all the values in a dataset and dividing by the number of values. Median is the middle value in a dataset when it is arranged in ascending or descending order. Mode is the value that appears most frequently in a dataset.

9. Regression analysis is a statistical method used to model the relationship between a dependent variable and one or more independent variables. It is commonly used in data science to predict or estimate values based on the relationship between variables.

10. Parametric tests make assumptions about the underlying distribution of the data, while non-parametric tests do not rely on specific distributional assumptions and are more flexible in their application.

11. Precision is the proportion of true positive predictions out of all positive predictions, while recall is the proportion of true positive predictions out of all actual positive instances.

12. A confidence interval is a range of values within which the true population parameter is estimated to lie with a certain level of confidence.

13. A box plot (also known as a box-and-whisker plot) displays the distribution of a dataset by showing the minimum, first quartile, median, third quartile, and maximum values. It provides information about the central tendency and spread of the data.

14. Bayes' theorem is a fundamental theorem in probability theory that describes how to update the probability of a hypothesis based on new evidence.

15. Hypothesis testing is a statistical method used to make inferences about population parameters based on sample data. It involves formulating a null hypothesis and an alternative hypothesis and conducting tests to assess the strength of evidence against the null hypothesis.

16. A discrete variable is one that can only take on a finite number of values or a countable number of values, while a continuous variable can take on any value within a certain range.

17. Overfitting occurs when a model performs well on the training data but fails to generalize to new, unseen data. It can be prevented by using techniques such as cross-validation, regularization, and feature selection.

18. A hypothesis is a proposed explanation for an observation or phenomenon, while a theory is a well-substantiated explanation that has been tested and confirmed through multiple lines of evidence.

19. Correlation refers to a statistical measure of the relationship between two variables, while causation implies a cause

19. What is the difference between a parametric and non-parametric test?
- Parametric tests assume that the data follows a specific distribution (usually the normal distribution) and the parameters of that distribution are known. Non-parametric tests, on the other hand, make no assumptions about the underlying distribution of the data.

20. What is the difference between precision and recall?
- Precision is the proportion of true positives (correctly identified cases) out of all the cases identified as positive, while recall is the proportion of true positives out of all actual positive cases.

21. What is a confidence interval?
- A confidence interval is a range of values calculated from a sample of data that is likely to contain the true population value with a certain degree of confidence (usually expressed as a percentage).

22. How do you interpret a box plot?
- A box plot is a graphical representation of the distribution of a dataset that shows the median, quartiles, and any outliers. The box represents the middle 50% of the data, with the bottom and top of the box representing the 25th and 75th percentiles respectively. The line inside the box represents the median. The whiskers extend from the box to the minimum and maximum values, excluding outliers.

23. What is Bayes' theorem?
- Bayes' theorem is a mathematical formula used to calculate the probability of an event occurring, given prior knowledge of related events. It states that the probability of an event A given that event B has occurred is equal to the probability of event B given that event A has occurred, multiplied by the prior probability of event A, divided by the prior probability of event B.

24. What is a hypothesis test and how is it used in data science?
- A hypothesis test is a statistical test used to determine whether there is enough evidence in a sample of data to infer that a certain hypothesis about the population is true. In data science, hypothesis testing is often used to test whether the effect of a certain variable on an outcome is statistically significant or not.

25. What is the difference between a discrete and continuous variable?
- A discrete variable is a variable that can only take on specific, separate values (such as the number of children in a family). A continuous variable, on the other hand, can take on any value within a range (such as height or weight).

26. What is overfitting and how can you prevent it?
- Overfitting is a common problem in machine learning where a model is trained to fit the noise in the training data rather than the underlying patterns, leading to poor performance on new, unseen data. It can be prevented by using techniques such as regularization, cross-validation, and early stopping.

27. What is the difference between a hypothesis and a theory?
- A hypothesis is a proposed explanation for an observed phenomenon, while a theory is a well-supported explanation that has been extensively tested and validated through scientific research.

28. What is the difference between correlation and causation?
- Correlation refers to a relationship between two variables where a change in one variable is associated with a change in the other variable. Causation, on the other hand, refers to a relationship where one variable causes a change in the other variable.

29. What is the difference between an outlier and a leverage point?
- An outlier is a data point that is significantly different from the rest of the data in the dataset. A leverage point, on the other hand, is a data point that has a high influence on the estimated parameters of a statistical model.

30. What is the difference between a null and alternative hypothesis?
- The null hypothesis is a statement that there is no significant difference between two groups or variables, while the alternative hypothesis is a statement that there is a significant difference. Hypothesis testing is

### Questions to the interviewer

Here are 10 potential questions you could ask an interviewer during a job interview:

1. Can you describe what a typical day in this position might look like?
3. How would you describe the company culture here?
5. What do you think are the biggest challenges that this position will face?
6. What do you enjoy most about working for this company?
8. How does the company approach work-life balance?
9. What qualities do you think the ideal candidate for this position would possess?
2. How does the company support professional development and growth opportunities for its employees?
3. What kind of data visualization tools does your team use and why?
4. Can you walk me through your data modeling process?
10. Is there anything else that I can provide or clarify for you that would help you in the hiring process?

## Interview exmples

### Example 1

https://www.youtube.com/watch?v=Us_TKT8ZL2E
11^46

1. Python
2. ML-алогоритмы
3. Работа с данными
4. Дизайн А/Б-теста


1. Python. Изменяемые и неизменяемые типы данных

In [5]:
a = 4
b = 4
id(a), id(b), id(4)

(140732830656384, 140732830656384, 140732830656384)

In [None]:
a = {1L ''}

### Example 2
https://www.youtube.com/watch?v=Ec6EYbcF50k

### Example 3
https://www.youtube.com/watch?v=2Obawm2vzDo

### Example 4
https://www.youtube.com/watch?v=EFDnfaevY4Q